# Introduction

Let us see if we can estimate the parameter for population growth using MCMC in the example in which we used Kalman filtering.

We recall the model.

\displaystyle \begin{aligned} \dot{p} & = rp\Big(1 - \frac{p}{k}\Big) \end{aligned}

$\displaystyle p = \frac{kp_0\exp rt}{k + p_0(\exp rt - 1)}$

And we are allowed to sample at regular intervals

\displaystyle \begin{aligned} p_i &= \frac{kp_0\exp r\Delta T i}{k + p_0(\exp r\Delta T i - 1)} \\ y_i &= p_i + \epsilon_i \end{aligned}

In other words $y_i \sim {\cal{N}}(p_i, \sigma^2)$, where $\sigma$ is known so the likelihood is

$\displaystyle p(y\,|\,r) \propto \prod_{i=1}^n \exp{\bigg( -\frac{(y_i - p_i)^2}{2\sigma^2}\bigg)} = \exp{\bigg( -\sum_{i=1}^n \frac{(y_i - p_i)^2}{2\sigma^2}\bigg)}$

Let us assume a prior of $r \sim {\cal{N}}(\mu_0,\sigma_0^2)$ then the posterior becomes

$\displaystyle p(r\,|\,y) \propto \exp{\bigg( -\frac{(r - \mu_0)^2}{2\sigma_0^2} \bigg)} \exp{\bigg( -\sum_{i=1}^n \frac{(y_i - p_i)^2}{2\sigma^2}\bigg)}$

## Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> module PopGrowthMCMC where
>
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )

# Implementation

We assume most of the parameters are known with the exception of the the growth rate $r$. We fix this also in order to generate test data.

> k, p0 :: Double
> k = 1.0
> p0 = 0.1
> r, deltaT :: Double
> r = 10.0
> deltaT = 0.0005
> nObs :: Int
> nObs = 150

Here’s the implementation of the logistic function

> logit :: Double -> Double -> Double -> Double
> logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))

Let us create some noisy data.

> sigma :: Double
> sigma = 1e-2
> samples :: [Double]
> samples = zipWith (+) mus epsilons
>   where
>     mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])
>     epsilons = evalState (sample $replicateM nObs$ rvar (Normal 0.0 sigma)) (pureMT 3)

Arbitrarily let us set the prior to have a rather vague normal distribution.

> mu0, sigma0 :: Double
> mu0 = 5.0
> sigma0 = 1e1
> prior :: Double -> Double
> prior r = exp (-(r - mu0)**2 / (2 * sigma0**2))
> likelihood :: Double -> [Double] -> Double
> likelihood r ys = exp (-sum (zipWith (\y mu -> (y - mu)**2 / (2 * sigma**2)) ys mus))
>   where
>     mus :: [Double]
>     mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])
> posterior :: Double -> [Double] -> Double
> posterior r ys = likelihood r ys * prior r

The Metropolis algorithm tells us that we always jump to a better place but only sometimes jump to a worse place. We count the number of acceptances as we go.

> acceptanceProb' :: Double -> Double -> [Double] -> Double
> acceptanceProb' r r' ys = min 1.0 ((posterior r' ys) / (posterior r ys))
> oneStep :: (Double, Int) -> (Double, Double) -> (Double, Int)
> oneStep (r, nAccs) (proposedJump, acceptOrReject) =
>   if acceptOrReject < acceptanceProb' r (r + proposedJump) samples
>   then (r + proposedJump, nAccs + 1)
>   else (r, nAccs)

Here are our proposals.

> normalisedProposals :: Int -> Double -> Int -> [Double]
> normalisedProposals seed sigma nIters =
>   evalState (replicateM nIters (sample (Normal 0.0 sigma)))
>   (pureMT $fromIntegral seed) We also need samples from the uniform distribution > acceptOrRejects :: Int -> Int -> [Double] > acceptOrRejects seed nIters = > evalState (replicateM nIters (sample stdUniform)) > (pureMT$ fromIntegral seed)

Now we can actually run our simulation. We set the number of jumps and a burn in but do not do any thinning.

> nIters, burnIn :: Int
> nIters = 100000
> burnIn = nIters div 10

Let us start our chain to the mean of the prior. In theory this shoudn’t matter as by the time we have burnt in we should be sampling in the high density region of the distribution.

> startMu :: Double
> startMu = 5.0

This jump size should allow us to sample the region of high density at a reasonable granularity.

> jumpVar :: Double
> jumpVar = 0.01

Now we can test our MCMC implementation.

> test :: Int -> [(Double, Int)]
> test seed =
>   drop burnIn $> scanl oneStep (startMu, 0)$
>   zip (normalisedProposals seed jumpVar nIters)
>       (acceptOrRejects seed nIters)

We put the data into a histogram.

> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
>   where
>     lower = r - 0.5 * sigma0
>     upper = r + 0.5 * sigma0
>
> hist :: Int -> Histogram V.Vector BinD Double
> hist seed = fillBuilder hb (map fst $test seed) With 50 observations we don’t seem to be very certain about the growth rate. With 100 observations we do very much better. And with 150 observations we do even better. # The Flow of the Thames: An Autoregressive Model # Thames Flux It is roughly 150 miles from the source of the Thames to Kingston Bridge. If we assume that it flows at about 2 miles per hour then the water at Thames Head will have reached Kingston very roughly at $\frac{150}{24\times 2} \approxeq 3$ days. The Environmental Agency measure the flux at Kingston Bridge on a twice daily basis. Can we predict this? In the first instance without any other data and using our observation that Thames flushes itself every 3 days, let us try $\displaystyle X_t = \theta_1 X_{t-1} + \theta_2 X_{t-2} + \theta_3 X_{t-3} + \epsilon_t$ where $X_t$ is the flux on day $t$ and $\{\epsilon_t\}_{t \in \mathbb{N}}$ are independent normal errors with mean $0$ and variance some given value $\sigma^2$. # Kalman As it stands, our model is not Markov so we cannot directly apply techniques such as Kalman filtering or particle filtering to estimate the parameters. However we can re-write the model as $\displaystyle \begin{bmatrix} \theta_1^{(t)} \\ \theta_2^{(t)} \\ \theta_3^{(t)} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \theta_1^{(t-1)} \\ \theta_2^{(t-1)} \\ \theta_3^{(t-1)} \end{bmatrix} + \begin{bmatrix} \eta_{t} \\ \eta_{t} \\ \eta_{t} \end{bmatrix}$ $\displaystyle y_t = \begin{bmatrix} x_{t-1} & x_{t-2} & x_{t-3} \end{bmatrix} \begin{bmatrix} \theta_1^{(t)} \\ \theta_2^{(t)} \\ \theta_3^{(t)} \end{bmatrix} + \epsilon_{t}$ Note that the observation map now varies over time so we have modify our Kalman filter implementation to accept a different matrix at each step. > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-} > {-# LANGUAGE DataKinds #-} > {-# LANGUAGE ScopedTypeVariables #-} > {-# LANGUAGE RankNTypes #-} > {-# LANGUAGE TypeOperators #-} > {-# LANGUAGE TypeFamilies #-} > module Autoregression ( > predictions > ) where > import GHC.TypeLits > import Numeric.LinearAlgebra.Static > import Data.Maybe ( fromJust ) > import qualified Data.Vector as V > inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n > inv m = fromJust$ linSolve m eye
> outer ::  forall m n . (KnownNat m, KnownNat n,
>                         (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
>           R n -> Sq n -> [L m n] -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer muPrior sigmaPrior bigHs bigSigmaY bigA bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) (zip ys bigHs)
>
>     update :: (R n, Sq n) -> (R m, L m n) -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) (y, bigH) =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>         v :: R m
>         v = y - bigH #> xHatFlat
>         bigS :: Sq m
>         bigS = bigH  bigSigmaHatFlat  (tr bigH) + bigSigmaY
>         bigK :: L n m
>         bigK = bigSigmaHatFlat  (tr bigH)  (inv bigS)
>         xHat :: R n
>         xHat = xHatFlat + bigK #> v
>         bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK  bigS  (tr bigK)
>         xHatFlatNew :: R n
>         xHatFlatNew = bigA #> xHat
>         bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA  bigSigmaHat  (tr bigA) + bigSigmaX

We can now set up the parameters to run the filter.

> stateVariance :: Double
> stateVariance = 1e-8
> bigSigmaX :: Sq 3
> bigSigmaX = fromList [ stateVariance, 0.0,           0.0
>                      , 0.0,           stateVariance, 0.0
>                      , 0.0,           0.0,           stateVariance
>                      ]
> bigA :: Sq 3
> bigA = eye
> muPrior :: R 3
> muPrior = fromList [0.0, 0.0, 0.0]
> sigmaPrior :: Sq 3
> sigmaPrior = fromList [ 1e1, 0.0, 0.0
>                       , 0.0, 1e1, 0.0
>                       , 0.0, 0.0, 1e1
>                       ]
> bigHsBuilder :: V.Vector Double -> [L 1 3]
> bigHsBuilder flows =
>   V.toList $> V.zipWith3 (\x0 x1 x2 -> fromList [x0, x1, x2]) > (V.tail flows) (V.tail$ V.tail flows) (V.tail $V.tail$ V.tail flows)
> obsVariance :: Double
> obsVariance = 1.0e-2
> bigSigmaY :: Sq 1
> bigSigmaY = fromList [ obsVariance ]
> predict :: R 3 -> Double -> Double -> Double -> Double
> predict theta f1 f2 f3 = h1 * f1 + h2 * f2 + h3 * f3
>   where
>     (h1, t1) = headTail theta
>     (h2, t2) = headTail t1
>     (h3, _)  = headTail t2
> thetas :: V.Vector Double -> [(R 3, Sq 3)]
> thetas flows = outer muPrior sigmaPrior (bigHsBuilder flows)
>                bigSigmaY bigA bigSigmaX (map (fromList . return) (V.toList flows))
> predictions :: V.Vector Double -> V.Vector Double
> predictions flows =
>   V.zipWith4 predict
>   (V.fromList $map fst (thetas flows)) > flows (V.tail flows) (V.tail$ V.tail flows)

# How Good is Our Model?

If we assume that parameters are essentially fixed by taking the state variance to be e.g. $10^{-8}$ then the fit is not good.

However, if we assume the parameters to undergo Brownian motion by taking the state variance to be e.g. $10^{-2}$ then we get a much better fit. Of course, Brownian motion is probably not a good way of modelling the parameters; we hardly expect that these could wander off to infinity.

# Introduction

Suppose you want to sample from the truncated normal distribution. One way to do this is to use rejection sampling. But if you do this naïvely then you will run into performance problems. The excellent Devroye (1986) who references Marsaglia (1964) gives an efficient rejection sampling scheme using the Rayleigh distribution. The random-fu package uses the Exponential distribution.

# Performance

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE FlexibleContexts             #-}
> import Data.Random
> import qualified Data.Random.Distribution.Normal as N
> import Data.Random.Source.PureMT

Here’s the naïve implementation.

> naiveReject :: Double -> RVar Double
> naiveReject x = doit
>   where
>     doit = do
>       y <- N.stdNormal
>       if y < x
>         then doit
>         else return y

And here’s an implementation using random-fu.

> expReject :: Double -> RVar Double
> expReject x = N.normalTail x

Let’s try running both of them

> n :: Int
> n = 10000000
> lower :: Double
> lower = 2.0
> testExp :: [Double]
> testExp = evalState (replicateM n $sample (expReject lower)) (pureMT 3) > testNaive :: [Double] > testNaive = evalState (replicateM n$ sample (naiveReject lower)) (pureMT 3)
> main :: IO ()
> main = do
>   print $sum testExp > print$ sum testNaive

Let’s try building and running both the naïve and better tuned versions.

ghc -O2 CompareRejects.hs

As we can see below we get 59.98s and 4.28s, a compelling reason to use the tuned version. And the difference in performance will get worse the less of the tail we wish to sample from.

## Tuned

2.3731610476911187e7
11,934,195,432 bytes allocated in the heap
5,257,328 bytes copied during GC
44,312 bytes maximum residency (2 sample(s))
21,224 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0     23145 colls,     0 par    0.09s    0.11s     0.0000s    0.0001s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    4.19s  (  4.26s elapsed)
GC      time    0.09s  (  0.11s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    4.28s  (  4.37s elapsed)

%GC     time       2.2%  (2.6% elapsed)

Alloc rate    2,851,397,967 bytes per MUT second

Productivity  97.8% of total user, 95.7% of total elapsed

## Naïve

2.3732073159369867e7
260,450,762,656 bytes allocated in the heap
111,891,960 bytes copied during GC
85,536 bytes maximum residency (2 sample(s))
76,112 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0     512768 colls,     0 par    1.86s    2.24s     0.0000s    0.0008s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time   58.12s  ( 58.99s elapsed)
GC      time    1.86s  (  2.24s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    59.98s  ( 61.23s elapsed)

%GC     time       3.1%  (3.7% elapsed)

Alloc rate    4,481,408,869 bytes per MUT second

Productivity  96.9% of total user, 94.9% of total elapsed

# Bibliography

Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer-Verlag. http://books.google.co.uk/books?id=mEw\_AQAAIAAJ.

Marsaglia, G. 1964. “Generating a Variable from the Tail of the Normal Distribution.” J-Technometrics 6 (1): 101–2.

# Introduction

Simple models for e.g. financial option pricing assume that the volatility of an index or a stock is constant, see here for example. However, simple observation of time series show that this is not the case; if it were then the log returns would be white noise

One approach which addresses this, GARCH (Generalised AutoRegressive Conditional Heteroskedasticity), models the evolution of volatility deterministically.

Stochastic volatility models treat the volatility of a return on an asset, such as an option to buy a security, as a Hidden Markov Model (HMM). Typically, the observable data consist of noisy mean-corrected returns on an underlying asset at equally spaced time points.

There is evidence that Stochastic Volatility models (Kim, Shephard, and Chib (1998)) offer increased flexibility over the GARCH family, e.g. see Geweke (1994), Fridman and Harris (1998) and Jacquier, Polson, and Rossi (1994). Despite this and judging by the numbers of questions on the R Special Interest Group on Finance mailing list, the use of GARCH in practice far outweighs that of Stochastic Volatility. Reasons cited are the multiplicity of estimation methods for the latter and the lack of packages (but see here for a recent improvement to the paucity of packages).

In their tutorial on particle filtering, Doucet and Johansen (2011) give an example of stochastic volatility. We save this approach for future blog posts and follow Lopes and Polson and the excellent lecture notes by Hedibert Lopes.

Here’s the model.

\displaystyle \begin{aligned} H_0 &\sim {\mathcal{N}}\left( m_0, C_0\right) \\ H_t &= \mu + \phi H_{t-1} + \tau \eta_t \\ Y_n &= \beta \exp(H_t / 2) \epsilon_n \\ \end{aligned}

We wish to estimate $\mu, \phi, \tau$ and $\boldsymbol{h}$. To do this via a Gibbs sampler we need to sample from

$\displaystyle {p \left( \mu, \phi, \tau \,\vert\, \boldsymbol{h}, \boldsymbol{y} \right)} \quad \text{and} \quad {p \left( \boldsymbol{h} \,\vert\, \mu, \phi, \tau, \boldsymbol{y} \right)}$

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE RecursiveDo                   #-}
> {-# LANGUAGE ExplicitForAll                #-}
> {-# LANGUAGE TypeOperators                 #-}
> {-# LANGUAGE TypeFamilies                  #-}
> {-# LANGUAGE ScopedTypeVariables           #-}
> {-# LANGUAGE DataKinds                     #-}
> {-# LANGUAGE FlexibleContexts              #-}
> module StochVol (
>     bigM
>   , bigM0
>   , runMC
>   , ys
>   , vols
>   , expectationTau2
>   , varianceTau2
>   ) where
> import Numeric.LinearAlgebra.HMatrix hiding ( (===), (|||), Element,
>                                               (<>), (#>), inv )
> import qualified Numeric.LinearAlgebra.Static as S
> import Numeric.LinearAlgebra.Static ( (<>) )
> import GHC.TypeLits
> import Data.Proxy
> import Data.Maybe ( fromJust )
> import Data.Random
> import Data.Random.Source.PureMT
> import Control.Monad.Writer hiding ( (<>) )
> import Control.Applicative
> import qualified Data.Vector as V
> inv :: (KnownNat n, (1 <=? n) ~ 'True) => S.Sq n -> S.Sq n
> inv m = fromJust S.linSolve m S.eye > infixr 8 #> > (#>) :: (KnownNat m, KnownNat n) => S.L m n -> S.R n -> S.R m > (#>) = (S.#>) > type StatsM a = RVarT (Writer [((Double, Double), Double)]) a > (|||) :: (KnownNat ((+) r1 r2), KnownNat r2, KnownNat c, KnownNat r1) => > S.L c r1 -> S.L c r2 -> S.L c ((+) r1 r2) > (|||) = (S.¦) # Marginal Distribution for Parameters Let us take a prior that is standard for linear regression $\displaystyle (\boldsymbol{\theta}, \tau^2) \sim {\mathcal{NIG}}(\boldsymbol{\theta}_0, V_0, \nu_0, s_0^2)$ where $\boldsymbol{\theta} = (\mu, \phi)^\top$ and use standard results for linear regression to obtain the required marginal distribution. That the prior is Normal Inverse Gamma (${\cal{NIG}}$) means \displaystyle \begin{aligned} \boldsymbol{\theta} \, | \, \tau^2 & \sim {\cal{N}}(\boldsymbol{\theta}_0, \tau^2 V_0) \\ \tau^2 & \sim {\cal{IG}}(\nu_0 / 2, \nu_0 s_0^2 / 2) \end{aligned} Standard Bayesian analysis for regression tells us that the (conditional) posterior distribution for $\displaystyle y_i = \beta + \alpha x_i + \epsilon_i$ where the $\{\epsilon_i\}$ are IID normal with variance $\sigma^2$ is given by $\displaystyle {p \left( \alpha, \beta, \eta \,\vert\, \boldsymbol{y}, \boldsymbol{x} \right)} = {\cal{N}}((\alpha, \beta); \mu_n, \sigma^2\Lambda_n^{-1})\,{\cal{IG}}(a_n, b_n)$ with $\displaystyle \Lambda_n = X_n^\top X_n + \Lambda_0$ $\displaystyle \begin{matrix} \mu_n = \Lambda_n^{-1}({X_n}^{\top}{X_n}\hat{\boldsymbol{\beta}}_n + \Lambda_0\mu_0) & \textrm{where} & \hat{\boldsymbol\beta}_n = ({X}_n^{\rm T}{X}_n)^{-1}{X}_n^{\rm T}\boldsymbol{y}_n \end{matrix}$ $\displaystyle \begin{matrix} a_n = \frac{n}{2} + a_0 & \quad & b_n = b_0 + \frac{1}{2}(\boldsymbol{y}^\top\boldsymbol{y} + \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n) \end{matrix}$ ## Recursive Form We can re-write the above recursively. We do not need to for this blog article but it will be required in any future blog article which uses Sequential Monte Carlo techniques. $\displaystyle \Lambda_n = \boldsymbol{x}_n^\top \boldsymbol{x}_n + \Lambda_{n-1}$ Furthermore $\displaystyle \Lambda_{n}\mu_{n} = {X}_{n}^{\rm T}\boldsymbol{y}_{n} + \Lambda_0\mu_0 = {X}_{n-1}^{\rm T}\boldsymbol{y}_{n-1} + \boldsymbol{x}_n^\top y_n + \Lambda_0\mu_0 = \Lambda_{n-1}\mu_{n-1} + \boldsymbol{x}_n^\top y_n$ so we can write $\displaystyle \boldsymbol{\mu}_n = \Lambda_n^{-1}(\Lambda_{n-1}\mu_{n-1} + \boldsymbol{x}_n^\top y_n)$ and $\displaystyle \begin{matrix} a_n = a_{n-1} + \frac{1}{2} & \quad & b_n = b_{n-1} + \frac{1}{2}\big[(y_n - \boldsymbol{\mu}_n^\top \boldsymbol{x}_n)y_n + (\boldsymbol{\mu}_{n-1} - \boldsymbol{\mu}_{n})^\top \Lambda_{n-1}\boldsymbol{\mu}_{n-1}\big] \end{matrix}$ ## Specialising In the case of our model we can specialise the non-recursive equations as $\displaystyle \Lambda_n = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \ldots & \ldots \\ 1 & x_n \end{bmatrix} + \Lambda_0$ $\displaystyle \begin{matrix} \mu_n = (\Lambda_n)^{-1}({X_n}^{\top}{X_n}\hat{\boldsymbol{\beta}}_n + \Lambda_0\mu_0) & \textrm{where} & \hat{\boldsymbol\beta}_n = ({X}_n^{\rm T}{X}_n)^{-1}{X}_n^{\rm T}\boldsymbol{x}_{1:n} \end{matrix}$ $\displaystyle \begin{matrix} a_n = \frac{n}{2} + a_0 & \quad & b_n = b_0 + \frac{1}{2}(\boldsymbol{x}_{1:n}^\top\boldsymbol{x}_{1:n} + \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n) \end{matrix}$ Let’s re-write the notation to fit our model. $\displaystyle \Lambda_n = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ h_1 & h_2 & \ldots & h_n \end{bmatrix} \begin{bmatrix} 1 & h_1 \\ 1 & h_2 \\ \ldots & \ldots \\ 1 & h_n \end{bmatrix} + \Lambda_0$ $\displaystyle \begin{matrix} \mu_n = (\Lambda_n)^{-1}({H_n}^{\top}{H_n}\hat{\boldsymbol{\theta}}_n + \Lambda_0\mu_0) & \textrm{where} & \hat{\boldsymbol\theta}_n = ({H}_n^{\rm T}{H}_n)^{-1}{H}_n^{\rm T}\boldsymbol{x}_{1:n} \end{matrix}$ $\displaystyle \begin{matrix} a_n = \frac{n}{2} + a_0 & \quad & b_n = b_0 + \frac{1}{2}(\boldsymbol{x}_{1:n}^\top\boldsymbol{x}_{1:n} + \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n) \end{matrix}$ Sample from ${p \left( \boldsymbol{\theta}, \tau^2 \,\vert\, \boldsymbol{h}, \boldsymbol{y} \right)} \sim {\mathcal{NIG}}(\boldsymbol{\theta}_1, V_1, \nu_1, s_1^2)$ We can implement this in Haskell as > sampleParms :: > forall n m . > (KnownNat n, (1 <=? n) ~ 'True) => > S.R n -> S.L n 2 -> S.R 2 -> S.Sq 2 -> Double -> Double -> > RVarT m (S.R 2, Double) > sampleParms y bigX theta_0 bigLambda_0 a_0 s_02 = do > let n = natVal (Proxy :: Proxy n) > a_n = 0.5 * (a_0 + fromIntegral n) > bigLambda_n = bigLambda_0 + (tr bigX) <> bigX > invBigLambda_n = inv bigLambda_n > theta_n = invBigLambda_n #> ((tr bigX) #> y + (tr bigLambda_0) #> theta_0) > b_0 = 0.5 * a_0 * s_02 > b_n = b_0 + > 0.5 * (S.extract (S.row y <> S.col y)!0!0) + > 0.5 * (S.extract (S.row theta_0 <> bigLambda_0 <> S.col theta_0)!0!0) - > 0.5 * (S.extract (S.row theta_n <> bigLambda_n <> S.col theta_n)!0!0) > g <- rvarT (Gamma a_n (recip b_n)) > let s2 = recip g > invBigLambda_n' = m <> invBigLambda_n > where > m = S.diag S.vector (replicate 2 s2)
>   m1 <- rvarT StdNormal
>   m2 <- rvarT StdNormal
>   let theta_n' :: S.R 2
>       theta_n' = theta_n + S.chol (S.sym invBigLambda_n') #> (S.vector [m1, m2])
>   return (theta_n', s2)

# Marginal Distribution for State

## Marginal for $H_0$

Using a standard result about conjugate priors and since we have

$\displaystyle h_0 \sim {\cal{N}}(m_0,C_0) \quad h_1 \vert h_0 \sim {\cal{N}}(\mu + \phi h_0, \tau^2)$

we can deduce

$\displaystyle h_0 \vert h_1 \sim {\cal{N}}(m_1,C_1)$

where

\displaystyle \begin{aligned} \frac{1}{C_1} &= \frac{1}{C_0} + \frac{\phi^2}{\tau^2} \\ \frac{m_1}{C_1} &= \frac{m_0}{C_0} + \frac{\phi(h_1 - \mu)}{\tau^2} \end{aligned}

> sampleH0 :: Double ->
>             Double ->
>             V.Vector Double ->
>             Double ->
>             Double ->
>             Double ->
>             RVarT m Double
> sampleH0 iC0 iC0m0 hs mu phi tau2 = do
>   let var = recip (iC0 + phi^2 / tau2) > mean = var * (iC0m0 + phi * ((hs V.! 0) - mu) / tau2) > rvarT (Normal mean (sqrt var)) ## Marginal for $H_1 \ldots H_n$ From the state equation, we have \displaystyle \begin{aligned} H_{t+1} &= \mu + \phi H_{t} + \tau \eta_t \\ \phi^2 H_{t} &= -\phi\mu + \phi H_{t+1} - \phi \tau \eta_t \\ \end{aligned} We also have \displaystyle \begin{aligned} H_{t} &= \mu + \phi H_{t-1} + \tau \eta_{t-1} \\ \end{aligned} Adding the two expressions together gives \displaystyle \begin{aligned} (1 + \phi^2)H_{t} &= \phi (H_{t-1} + H_{t+1}) + \mu (1 - \phi) + \tau(\eta_{t-1} - \phi\eta_t) \\ H_{t} &= \frac{\phi}{1 + \phi^2} (H_{t-1} + H_{t+1}) + \mu \frac{1 - \phi}{1 + \phi^2} + \frac{\tau}{1 + \phi^2}(\eta_{t-1} - \phi\eta_t) \\ \end{aligned} Since $\{\eta_t\}$ are standard normal, then $H_t$ conditional on $H_{t-1}$ and $H_{t+1}$ is normally distributed, and \displaystyle \begin{aligned} \mathbb{E}(H_n\mid H_{n-1}, H_{n+1}) &= \frac{1 - \phi}{1+\phi^2}\mu + \frac{\phi}{1+\phi^2}(H_{n-1} + H_{n+1}) \\ \mathbb{V}(H_n\mid H_{n-1}, H_{n+1}) &= \frac{\tau^2}{1+\phi^2} \end{aligned} We also have $\displaystyle h_{n+1} \vert h_n \sim {\cal{N}}(\mu + \phi h_n, \tau^2)$ Writing $\displaystyle \boldsymbol{h}_{-t} = \begin{bmatrix} h_0, & h_1, & \ldots, & h_{t-1}, & h_{t+1}, & \ldots, & h_{n-1}, & h_n \end{bmatrix}$ by Bayes’ Theorem we have $\displaystyle {p \left( h_t \,\vert\, \boldsymbol{h}_{-t}, \theta, \boldsymbol{y} \right)} \propto {p \left( y_t \,\vert\, h_t \right)} {p \left( h_t \,\vert\, \boldsymbol{h}_{-t}, \theta, y_{-t} \right)} = f_{\cal{N}}(y_t;0,e^{h_t}) f_{\cal{N}}(h_t;\mu_t,\nu_t^2)$ where $f_{\cal{N}}(x;\mu,\sigma^2)$ is the probability density function of a normal distribution. We can sample from this using Metropolis 1. For each $t$, sample $h_t^\flat$ from ${\cal{N}}(h_t, \gamma^2)$ where $\gamma^2$ is the tuning variance. 2. For each $t=1, \ldots, n$, compute the acceptance probability $\displaystyle p_t = \min{\Bigg(\frac{f_{\cal{N}}(h^\flat_t;\mu_t,\nu_t^2) f_{\cal{N}}(y_t;0,e^{h^\flat_t})}{f_{\cal{N}}(h_t;\mu_t,\nu_t^2) f_{\cal{N}}(y_t;0,e^{h_t})}, 1 \Bigg)}$ 1. For each $t$, compute a new value of $h_t$ $\displaystyle h^\sharp_t = \begin{cases} h^\flat_t \text{with probability } p_t \\ h_t \text{with probability } 1 - p_t \end{cases}$ > metropolis :: V.Vector Double -> > Double -> > Double -> > Double -> > Double -> > V.Vector Double -> > Double -> > RVarT m (V.Vector Double) > metropolis ys mu phi tau2 h0 hs vh = do > let eta2s = V.replicate (n-1) (tau2 / (1 + phi^2)) V.snoc tau2 > etas = V.map sqrt eta2s > coef1 = (1 - phi) / (1 + phi^2) * mu > coef2 = phi / (1 + phi^2) > mu_n = mu + phi * (hs V.! (n-1)) > mu_1 = coef1 + coef2 * ((hs V.! 1) + h0) > innerMus = V.zipWith (\hp1 hm1 -> coef1 + coef2 * (hp1 + hm1)) (V.tail (V.tail hs)) hs > mus = mu_1 V.cons innerMus V.snoc mu_n > hs' <- V.mapM (\mu -> rvarT (Normal mu vh)) hs > let num1s = V.zipWith3 (\mu eta h -> logPdf (Normal mu eta) h) mus etas hs' > num2s = V.zipWith (\y h -> logPdf (Normal 0.0 (exp (0.5 * h))) y) ys hs' > nums = V.zipWith (+) num1s num2s > den1s = V.zipWith3 (\mu eta h -> logPdf (Normal mu eta) h) mus etas hs > den2s = V.zipWith (\y h -> logPdf (Normal 0.0 (exp (0.5 * h))) y) ys hs > dens = V.zipWith (+) den1s den2s > us <- V.replicate n <> rvarT StdUniform
>   let ls   = V.zipWith (\n d -> min 0.0 (n - d)) nums dens
>   return $V.zipWith4 (\u l h h' -> if log u < l then h' else h) us ls hs hs' # Markov Chain Monte Carlo Now we can write down a single step for our Gibbs sampler, sampling from each marginal in turn. > singleStep :: Double -> V.Vector Double -> > (Double, Double, Double, Double, V.Vector Double) -> > StatsM (Double, Double, Double, Double, V.Vector Double) > singleStep vh y (mu, phi, tau2, h0, h) = do > lift$ tell [((mu, phi),tau2)]
>   hNew <- metropolis y mu phi tau2 h0 h vh
>   h0New <- sampleH0 iC0 iC0m0 hNew mu phi tau2
>   let bigX' = (S.col $S.vector$ replicate n 1.0)
>               |||
>               (S.col $S.vector$ V.toList $h0New V.cons V.init hNew) > bigX = bigX' asTypeOf (snd$ valAndType nT)
>   newParms <- sampleParms (S.vector $V.toList h) bigX (S.vector [mu0, phi0]) invBigV0 nu0 s02 > return ( (S.extract (fst newParms))!0 > , (S.extract (fst newParms))!1 > , snd newParms > , h0New > , hNew > ) # Testing Let’s create some test data. > mu', phi', tau2', tau' :: Double > mu' = -0.00645 > phi' = 0.99 > tau2' = 0.15^2 > tau' = sqrt tau2' We need to create a statically typed matrix with one dimension the same size as the data so we tie the data size value to the required type. > nT :: Proxy 500 > nT = Proxy > valAndType :: KnownNat n => Proxy n -> (Int, S.L n 2) > valAndType x = (fromIntegral$ natVal x, undefined)
> n :: Int
> n = fst $valAndType nT Arbitrarily let us start the process at > h0 :: Double > h0 = 0.0 We define the process as a stream (aka co-recursively) using the Haskell recursive do construct. It is not necessary to do this but streams are a natural way to think of stochastic processes. > hs, vols, sds, ys :: V.Vector Double > hs = V.fromList$ take n $fst$ runState hsAux (pureMT 1)
>   where
>     hsAux = mdo { x0 <- sample (Normal (mu' + phi' * h0) tau')
>                 ; xs <- mapM (\x -> sample (Normal (mu' + phi' * x) tau')) (x0:xs)
>                 ; return xs
>                 }
> vols = V.map exp hs

We can plot the volatility (which we cannot observe directly).

And we can plot the log returns.

> sds = V.map sqrt vols
> ys = fst $runState ysAux (pureMT 2) > where > ysAux = V.mapM (\sd -> sample (Normal 0.0 sd)) sds We start with a vague prior for $H_0$ > m0, c0 :: Double > m0 = 0.0 > c0 = 100.0 For convenience > iC0, iC0m0 :: Double > iC0 = recip c0 > iC0m0 = iC0 * m0 Rather than really sample from priors for $\mu, \phi$ and $\tau^2$ let us cheat and assume we sampled the simulated values! > mu0, phi0, tau20 :: Double > mu0 = -0.00645 > phi0 = 0.99 > tau20 = 0.15^2 But that we are still very uncertain about them > bigV0, invBigV0 :: S.Sq 2 > bigV0 = S.diag$ S.fromList [100.0, 100.0]
> invBigV0 = inv bigV0
> nu0, s02 :: Double
> nu0    = 10.0
> s02    = (nu0 - 2) / nu0 * tau20

Note that for the inverse gamma this gives

> expectationTau2, varianceTau2 :: Double
> expectationTau2 = (nu0 * s02 / 2) / ((nu0 / 2) - 1)
> varianceTau2 = (nu0 * s02 / 2)^2 / (((nu0 / 2) - 1)^2 * ((nu0 / 2) - 2))
ghci> expectationTau2
2.25e-2

ghci> varianceTau2
1.6874999999999998e-4

## Running the Markov Chain

Tuning parameter

> vh :: Double
> vh = 0.1

The burn-in and sample sizes may be too low for actual estimation but will suffice for a demonstration.

> bigM, bigM0 :: Int
> bigM0 = 2000
> bigM  = 2000
> multiStep :: StatsM (Double, Double, Double, Double, V.Vector Double)
> multiStep = iterateM_ (singleStep vh ys) (mu0, phi0, tau20, h0, vols)
> runMC :: [((Double, Double), Double)]
> runMC = take bigM $drop bigM0$
>         execWriter (evalStateT (sample multiStep) (pureMT 42))

And now we can look at the distributions of our estimates

# Bibliography

Doucet, Arnaud, and Adam M Johansen. 2011. “A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later.” In Handbook of Nonlinear Filtering. Oxford, UK: Oxford University Press.

Fridman, Moshe, and Lawrence Harris. 1998. “A Maximum Likelihood Approach for Non-Gaussian Stochastic Volatility Models.” Journal of Business & Economic Statistics 16 (3): 284–91.

Geweke, John. 1994. “Bayesian Comparison of Econometric Models.”

Jacquier, Eric, Nicholas G. Polson, and Peter E. Rossi. 1994. “Bayesian Analysis of Stochastic Volatility Models.”

Kim, Sangjoon, Neil Shephard, and Siddhartha Chib. 1998. “Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models.” Review of Economic Studies 65 (3): 361–93. http://ideas.repec.org/a/bla/restud/v65y1998i3p361-93.html.

# Summary

An extended Kalman filter in Haskell using type level literals and automatic differentiation to provide some guarantees of correctness.

# Population Growth

Suppose we wish to model population growth of bees via the logistic equation

\displaystyle \begin{aligned} \dot{p} & = rp\Big(1 - \frac{p}{k}\Big) \end{aligned}

We assume the growth rate $r$ is unknown and drawn from a normal distribution ${\cal{N}}(\mu_r, \sigma_r^2)$ but the carrying capacity $k$ is known and we wish to estimate the growth rate by observing noisy values $y_i$ of the population at discrete times $t_0 = 0, t_1 = \Delta T, t_2 = 2\Delta T, \ldots$. Note that $p_t$ is entirely deterministic and its stochasticity is only as a result of the fact that the unknown parameter of the logistic equation is sampled from a normal distribution (we could for example be observing different colonies of bees and we know from the literature that bee populations obey the logistic equation and each colony will have different growth rates).

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> {-# LANGUAGE BangPatterns                 #-}
> {-# LANGUAGE TypeOperators                #-}
> {-# LANGUAGE TypeFamilies                 #-}
> module FunWithKalman3 where
> import GHC.TypeLits
> import Numeric.LinearAlgebra.Static
> import Data.Maybe ( fromJust )
> import Data.Random.Source.PureMT
> import Data.Random
> import qualified Control.Monad.Writer as W

## Logistic Equation

The logistic equation is a well known example of a dynamical system which has an analytic solution

$\displaystyle p = \frac{kp_0\exp rt}{k + p_0(\exp rt - 1)}$

> logit :: Floating a => a -> a -> a -> a
> logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))

We observe a noisy value of population at regular time intervals (where $\Delta T$ is the time interval)

\displaystyle \begin{aligned} p_i &= \frac{kp_0\exp r\Delta T i}{k + p_0(\exp r\Delta T i - 1)} \\ y_i &= p_i + \epsilon_i \end{aligned}

Using the semi-group property of our dynamical system, we can re-write this as

\displaystyle \begin{aligned} p_i &= \frac{kp_{i-1}\exp r\Delta T}{k + p_{i-1}(\exp r\Delta T - 1)} \\ y_i &= p_i + \epsilon_i \end{aligned}

To convince yourself that this re-formulation is correct, think of the population as starting at $p_0$; after 1 time step it has reached $p_1$ and and after two time steps it has reached $p_2$ and this ought to be the same as the point reached after 1 time step starting at $p_1$, for example

> oneStepFrom0, twoStepsFrom0, oneStepFrom1 :: Double
> oneStepFrom0  = logit 0.1 1.0 (1 * 0.1)
> twoStepsFrom0 = logit 0.1 1.0 (1 * 0.2)
> oneStepFrom1  = logit oneStepFrom0 1.0 (1 * 0.1)
ghci> twoStepsFrom0
0.11949463171139338

ghci> oneStepFrom1
0.1194946317113934

We would like to infer the growth rate not just be able to predict the population so we need to add another variable to our model.

\displaystyle \begin{aligned} r_i &= r_{i-1} \\ p_i &= \frac{kp_{i-1}\exp r_{i-1}\Delta T}{k + p_{i-1}(\exp r_{i-1}\Delta T - 1)} \\ y_i &= \begin{bmatrix}0 & 1\end{bmatrix}\begin{bmatrix}r_i \\ p_i\end{bmatrix} + \begin{bmatrix}0 \\ \epsilon_i\end{bmatrix} \end{aligned}

# Extended Kalman

This is almost in the form suitable for estimation using a Kalman filter but the dependency of the state on the previous state is non-linear. We can modify the Kalman filter to create the extended Kalman filter (EKF) by making a linear approximation.

Since the measurement update is trivially linear (even in this more general form), the measurement update step remains unchanged.

\displaystyle \begin{aligned} \boldsymbol{v}_i & \triangleq \boldsymbol{y}_i - \boldsymbol{g}(\hat{\boldsymbol{x}}^\flat_i) \\ \boldsymbol{S}_i & \triangleq \boldsymbol{G}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{G}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\ \boldsymbol{K}_i & \triangleq \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{G}_i^\top\boldsymbol{S}^{-1}_i \\ \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \boldsymbol{K}_i\boldsymbol{v}_i \\ \hat{\boldsymbol{\Sigma}}_i &\triangleq \hat{\boldsymbol{\Sigma}}^\flat_i - \boldsymbol{K}_i\boldsymbol{S}_i\boldsymbol{K}^\top_i \end{aligned}

By Taylor we have

$\displaystyle \boldsymbol{a}(\boldsymbol{x}) \approx \boldsymbol{a}(\boldsymbol{m}) + \boldsymbol{A}_{\boldsymbol{x}}(\boldsymbol{m})\delta\boldsymbol{x}$

where $\boldsymbol{A}_{\boldsymbol{x}}(\boldsymbol{m})$ is the Jacobian of $\boldsymbol{a}$ evaluated at $\boldsymbol{m}$ (for the exposition of the extended filter we take $\boldsymbol{a}$ to be vector valued hence the use of a bold font). We take $\delta\boldsymbol{x}$ to be normally distributed with mean of 0 and ignore any difficulties there may be with using Taylor with stochastic variables.

Applying this at $\boldsymbol{m} = \hat{\boldsymbol{x}}_{i-1}$ we have

$\displaystyle \boldsymbol{x}_i = \boldsymbol{a}(\hat{\boldsymbol{x}}_{i-1}) + \boldsymbol{A}_{\boldsymbol{x}}(\hat{\boldsymbol{x}}_{i-1})(\boldsymbol{x}_{i-1} - \hat{\boldsymbol{x}}_{i-1}) + \boldsymbol{\epsilon}_i$

Using the same reasoning as we did as for Kalman filters and writing $\boldsymbol{A}_{i-1}$ for $\boldsymbol{A}_{\boldsymbol{x}}(\hat{\boldsymbol{x}}_{i-1})$ we obtain

\displaystyle \begin{aligned} \hat{\boldsymbol{x}}^\flat_i &= \boldsymbol{a}(\hat{\boldsymbol{x}}_{i-1}) \\ \hat{\boldsymbol{\Sigma}}^\flat_i &= \boldsymbol{A}_{i-1} \hat{\boldsymbol{\Sigma}}_{i-1} \boldsymbol{A}_{i-1}^\top + \boldsymbol{\Sigma}^{(x)}_{i-1} \end{aligned}

Note that we pass in the Jacobian of the update function as a function itself in the case of the extended Kalman filter rather than the matrix representing the linear function as we do in the case of the classical Kalman filter.

> k, p0 :: Floating a => a
> k = 1.0
> p0 = 0.1 * k
> r, deltaT :: Floating a => a
> r = 10.0
> deltaT = 0.0005

Relating ad and hmatrix is somewhat unpleasant but this can probably be ameliorated by defining a suitable datatype.

> a :: R 2 -> R 2
> a rpPrev = rNew # pNew
>   where
>     (r, pPrev) = headTail rpPrev
>     rNew :: R 1
>     rNew = konst r
>
>     (p,  _) = headTail pPrev
>     pNew :: R 1
>     pNew = fromList $[logit p k (r * deltaT)] > bigA :: R 2 -> Sq 2 > bigA rp = fromList$ concat $j [r, p] > where > (r, ps) = headTail rp > (p, _) = headTail ps > j = jacobian (\[r, p] -> [r, logit p k (r * deltaT)]) For some reason, hmatrix with static guarantees does not yet provide an inverse function for matrices. > inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n > inv m = fromJust$ linSolve m eye

Here is the extended Kalman filter itself. The type signatures on the expressions inside the function are not necessary but did help the implementor discover a bug in the mathematical derivation and will hopefully help the reader.

> outer ::  forall m n . (KnownNat m, KnownNat n,
>                         (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
>           R n -> Sq n ->
>           L m n -> Sq m ->
>           (R n -> R n) -> (R n -> Sq n) -> Sq n ->
>           [R m] ->
>           [(R n, Sq n)]
> outer muPrior sigmaPrior bigH bigSigmaY
>       littleA bigABuilder bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) ys
>
>     update :: (R n, Sq n) -> R m -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) y =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>
>         v :: R m
>         v = y - (bigH #> xHatFlat)
>
>         bigS :: Sq m
>         bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY
>
>         bigK :: L n m
>         bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS)
>
>         xHat :: R n
>         xHat = xHatFlat + bigK #> v
>
>         bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK)
>
>         bigA :: Sq n
>         bigA = bigABuilder xHat
>
>         xHatFlatNew :: R n
>         xHatFlatNew = littleA xHat
>
>         bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX

Now let us create some sample data.

> obsVariance :: Double
> obsVariance = 1e-2
> bigSigmaY :: Sq 1
> bigSigmaY = fromList [obsVariance]
> nObs :: Int
> nObs = 300
> singleSample :: Double -> RVarT (W.Writer [Double]) Double
> singleSample p0 = do
>   epsilon <- rvarT (Normal 0.0 obsVariance)
>   let p1 = logit p0 k (r * deltaT)
>   lift $W.tell [p1 + epsilon] > return p1 > streamSample :: RVarT (W.Writer [Double]) Double > streamSample = iterateM_ singleSample p0 > samples :: [Double] > samples = take nObs$ snd $> W.runWriter (evalStateT (sample streamSample) (pureMT 3)) We created our data with a growth rate of ghci> r 10.0 but let us pretend that we have read the literature on growth rates of bee colonies and we have some big doubts about growth rates but we are almost certain about the size of the colony at $t=0$. > muPrior :: R 2 > muPrior = fromList [5.0, 0.1] > > sigmaPrior :: Sq 2 > sigmaPrior = fromList [ 1e2, 0.0 > , 0.0, 1e-10 > ] We only observe the population and not the rate itself. > bigH :: L 1 2 > bigH = fromList [0.0, 1.0] Strictly speaking this should be 0 but this is close enough. > bigSigmaX :: Sq 2 > bigSigmaX = fromList [ 1e-10, 0.0 > , 0.0, 1e-10 > ] Now we can run our filter and watch it switch away from our prior belief as it accumulates more and more evidence. > test :: [(R 2, Sq 2)] > test = outer muPrior sigmaPrior bigH bigSigmaY > a bigA bigSigmaX (map (fromList . return) samples) # Importance Sampling # Importance Sampling Suppose we have an random variable $X$ with pdf $1/2\exp{-\lvert x\rvert}$ and we wish to find its second moment numerically. However, the random-fu package does not support sampling from such as distribution. We notice that $\displaystyle \int_{-\infty}^\infty x^2 \frac{1}{2} \exp{-\lvert x\rvert} \mathrm{d}x = \int_{-\infty}^\infty x^2 \frac{\frac{1}{2} \exp{-\lvert x\rvert}} {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}} \frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}} \,\mathrm{d}x$ So we can sample from ${\cal{N}}(0, 4)$ and evaluate $\displaystyle x^2 \frac{\frac{1}{2} \exp{-\lvert x\rvert}} {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}}$ > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-} > module Importance where > import Control.Monad > import Data.Random.Source.PureMT > import Data.Random > import Data.Random.Distribution.Binomial > import Data.Random.Distribution.Beta > import Control.Monad.State > import qualified Control.Monad.Writer as W > sampleImportance :: RVarT (W.Writer [Double]) () > sampleImportance = do > x <- rvarT$ Normal 0.0 2.0
>   let x2 = x^2
>       u = x2 * 0.5 * exp (-(abs x))
>       v = (exp ((-x2)/8)) * (recip (sqrt (8*pi)))
>       w = u / v
>   lift $W.tell [w] > return () > runImportance :: Int -> [Double] > runImportance n = > snd$
>   W.runWriter $> evalStateT (sample (replicateM n sampleImportance)) > (pureMT 2) We can run this 10,000 times to get an estimate. ghci> import Formatting ghci> format (fixed 2) (sum (runImportance 10000) / 10000) "2.03" Since we know that the $n$-th moment of the exponential distribution is $n! / \lambda^n$ where $\lambda$ is the rate (1 in this example), the exact answer is 2 which is not too far from our estimate using importance sampling. The value of $\displaystyle w(x) = \frac{1}{N}\frac{\frac{1}{2} \exp{-\lvert x\rvert}} {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}} = \frac{p(x)}{\pi(x)}$ is called the weight, $p$ is the pdf from which we wish to sample and $\pi$ is the pdf of the importance distribution. # Importance Sampling Approximation of the Posterior Suppose that the posterior distribution of a model in which we are interested has a complicated functional form and that we therefore wish to approximate it in some way. First assume that we wish to calculate the expectation of some arbitrary function $f$ of the parameters. $\displaystyle {\mathbb{E}}(f({x}) \,\vert\, y_1, \ldots y_T) = \int_\Omega f({x}) p({x} \, \vert \, y_1, \ldots y_T) \,\mathrm{d}{x}$ Using Bayes $\displaystyle \int_\Omega f({x}) {p\left(x \,\vert\, y_1, \ldots y_T\right)} \,\mathrm{d}{x} = \frac{1}{Z}\int_\Omega f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x) \,\mathrm{d}{x}$ where $Z$ is some normalizing constant. As before we can re-write this using a proposal distribution $\pi(x)$ $\displaystyle \frac{1}{Z}\int_\Omega f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x) \,\mathrm{d}{x} = \frac{1}{Z}\int_\Omega \frac{f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x)}{\pi(x)}\pi(x) \,\mathrm{d}{x}$ We can now sample $X^{(i)} \sim \pi({x})$ repeatedly to obtain $\displaystyle {\mathbb{E}}(f({x}) \,\vert\, y_1, \ldots y_T) \approx \frac{1}{ZN}\sum_1^N f({X^{(i)}}) \frac{p(y_1, \ldots y_T \, \vert \, {X^{(i)}})p({X^{(i)}})} {\pi({X^{(i)}})} = \sum_1^N w_if({X^{(i)}})$ where the weights $w_i$ are defined as before by $\displaystyle w_i = \frac{1}{ZN} \frac{p(y_1, \ldots y_T \, \vert \, {X^{(i)}})p({X^{(i)}})} {\pi({X^{(i)}})}$ We follow Alex Cook and use the example from (Rerks-Ngarm et al. 2009). We take the prior as $\sim {\cal{Be}}(1,1)$ and use ${\cal{U}}(0.0,1.0)$ as the proposal distribution. In this case the proposal and the prior are identical just expressed differently and therefore cancel. Note that we use the log of the pdf in our calculations otherwise we suffer from (silent) underflow, e.g., ghci> pdf (Binomial nv (0.4 :: Double)) xv 0.0 On the other hand if we use the log pdf form ghci> logPdf (Binomial nv (0.4 :: Double)) xv -3900.8941170876574 > xv, nv :: Int > xv = 51 > nv = 8197 > sampleUniform :: RVarT (W.Writer [Double]) () > sampleUniform = do > x <- rvarT StdUniform > lift$ W.tell [x]
>   return ()
> runSampler :: RVarT (W.Writer [Double]) () ->
>               Int -> Int -> [Double]
> runSampler sampler seed n =
>   snd $> W.runWriter$
>   evalStateT (sample (replicateM n sampler))
>              (pureMT (fromIntegral seed))
> sampleSize :: Int
> sampleSize = 1000
> pv :: [Double]
> pv = runSampler sampleUniform 2 sampleSize
> logWeightsRaw :: [Double]
> logWeightsRaw = map (\p -> logPdf (Beta 1.0 1.0) p +
>                            logPdf (Binomial nv p) xv -
>                            logPdf StdUniform p) pv
> logWeightsMax :: Double
> logWeightsMax = maximum logWeightsRaw
>
> weightsRaw :: [Double]
> weightsRaw = map (\w -> exp (w - logWeightsMax)) logWeightsRaw
> weightsSum :: Double
> weightsSum = sum weightsRaw
> weights :: [Double]
> weights = map (/ weightsSum) weightsRaw
> meanPv :: Double
> meanPv = sum $zipWith (*) pv weights > > meanPv2 :: Double > meanPv2 = sum$ zipWith (\p w -> p * p * w) pv weights
>
> varPv :: Double
> varPv = meanPv2 - meanPv * meanPv

ghci> meanPv
6.400869727227364e-3

But if we look at the size of the weights and the effective sample size

ghci> length $filter (>= 1e-6) weights 9 ghci> (sum weights)^2 / (sum$ map (^2) weights)
4.581078458313967

so we may not be getting a very good estimate. Let’s try

> sampleNormal :: RVarT (W.Writer [Double]) ()
> sampleNormal = do
>   x <- rvarT $Normal meanPv (sqrt varPv) > lift$ W.tell [x]
>   return ()
> pvC :: [Double]
> pvC = runSampler sampleNormal 3 sampleSize
> logWeightsRawC :: [Double]
> logWeightsRawC = map (\p -> logPdf (Beta 1.0 1.0) p +
>                             logPdf (Binomial nv p) xv -
>                             logPdf (Normal meanPv (sqrt varPv)) p) pvC
> logWeightsMaxC :: Double
> logWeightsMaxC = maximum logWeightsRawC
>
> weightsRawC :: [Double]
> weightsRawC = map (\w -> exp (w - logWeightsMaxC)) logWeightsRawC
> weightsSumC :: Double
> weightsSumC = sum weightsRawC
> weightsC :: [Double]
> weightsC = map (/ weightsSumC) weightsRawC
> meanPvC :: Double
> meanPvC = sum $zipWith (*) pvC weightsC > meanPvC2 :: Double > meanPvC2 = sum$ zipWith (\p w -> p * p * w) pvC weightsC
>
> varPvC :: Double
> varPvC = meanPvC2 - meanPvC * meanPvC

Now the weights and the effective size are more re-assuring

ghci> length $filter (>= 1e-6) weightsC 1000 ghci> (sum weightsC)^2 / (sum$ map (^2) weightsC)
967.113872888872

And we can take more confidence in the estimate

ghci> meanPvC
6.371225269833208e-3

# Bibliography

Rerks-Ngarm, Supachai, Punnee Pitisuttithum, Sorachai Nitayaphan, Jaranit Kaewkungwal, Joseph Chiu, Robert Paris, Nakorn Premsri, et al. 2009. “Vaccination with ALVAC and AIDSVAX to Prevent HIV-1 Infection in Thailand.” New England Journal of Medicine 361 (23) (December 3): 2209–2220. doi:10.1056/nejmoa0908492. http://dx.doi.org/10.1056/nejmoa0908492.

# Introduction

Suppose we have particle moving in at constant velocity in 1 dimension, where the velocity is sampled from a distribution. We can observe the position of the particle at fixed intervals and we wish to estimate its initial velocity. For generality, let us assume that the positions and the velocities can be perturbed at each interval and that our measurements are noisy.

A point of Haskell interest: using type level literals caught a bug in the mathematical description (one of the dimensions of a matrix was incorrect). Of course, this would have become apparent at run-time but proof checking of this nature is surely the future for mathematicians. One could conceive of writing an implementation of an algorithm or proof, compiling it but never actually running it purely to check that some aspects of the algorithm or proof are correct.

# The Mathematical Model

We take the position as $x_i$ and the velocity $v_i$:

\displaystyle \begin{aligned} x_i &= x_{i-1} + \Delta T v_{i-1} + \psi^{(x)}_i \\ v_i &= v_{i-1} + \psi^{(v)}_i \\ y_i &= a_i x_i + \upsilon_i \end{aligned}

where $\psi^{(x)}_i, \psi^{(v)}_i$ and $\upsilon_i$ are all IID normal with means of 0 and variances of $\sigma^2_x, \sigma^2_v$ and $\sigma^2_y$

We can re-write this as

\displaystyle \begin{aligned} \boldsymbol{x}_i &= \boldsymbol{A}_{i-1}\boldsymbol{x}_{i-1} + \boldsymbol{\psi}_{i-1} \\ \boldsymbol{y}_i &= \boldsymbol{H}_i\boldsymbol{x}_i + \boldsymbol{\upsilon}_i \end{aligned}

where

$\displaystyle \boldsymbol{A}_i = \begin{bmatrix} 1 & \Delta T\\ 0 & 1\\ \end{bmatrix} ,\quad \boldsymbol{H}_i = \begin{bmatrix} a_i & 0 \\ \end{bmatrix} ,\quad \boldsymbol{\psi}_i \sim {\cal{N}}\big(0,\boldsymbol{\Sigma}^{(x)}_i\big) ,\quad \boldsymbol{\Sigma}^{(x)}_i = \begin{bmatrix} \sigma^2_{x} & 0\\ 0 & \sigma^2_{v} \\ \end{bmatrix} ,\quad \boldsymbol{\upsilon}_i \sim {\cal{N}}\big(0,\boldsymbol{\Sigma}^{(y)}_i\big) ,\quad \boldsymbol{\Sigma}^{(y)}_i = \begin{bmatrix} \sigma^2_{z} \\ \end{bmatrix}$

Let us denote the mean and variance of $\boldsymbol{X}_i\,\vert\,\boldsymbol{Y}_{i-1}$ as $\hat{\boldsymbol{x}}^\flat_i$ and $\hat{\boldsymbol{\Sigma}}^\flat_i$ respectively and note that

\displaystyle \begin{aligned} {\boldsymbol{Y}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} = {\boldsymbol{H}_i\boldsymbol{X}_i\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Upsilon}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} = {\boldsymbol{H}_i\boldsymbol{X}_i\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Upsilon}_i} \end{aligned}

Since ${\boldsymbol{X}_i}\,\vert\,{\boldsymbol{Y}_{i-1}}$ and ${\boldsymbol{Y}_i}\,\vert\,{\boldsymbol{Y}_{i-1}}$ are jointly Gaussian and recalling that $({\hat{\boldsymbol{\Sigma}}^\flat_i})^\top$ = $\hat{\boldsymbol{\Sigma}}^\flat_i$ as covariance matrices are symmetric, we can calculate their mean and covariance matrix as

$\displaystyle \begin{bmatrix} \hat{\boldsymbol{x}}^\flat_i \\ \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i \end{bmatrix} ,\quad \begin{bmatrix} \hat{\boldsymbol{\Sigma}}^\flat_i & \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top \\ \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i & \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\ \end{bmatrix}$

We can now use standard formulæ which say if

$\displaystyle \begin{bmatrix} \boldsymbol{X} \\ \boldsymbol{Y} \end{bmatrix} \sim {\cal{N}} \begin{bmatrix} \begin{bmatrix} \boldsymbol{\mu}_x \\ \boldsymbol{\mu}_y \end{bmatrix} & , & \begin{bmatrix} \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\ \boldsymbol{\Sigma}^\top_{xy} & \boldsymbol{\Sigma}_y \end{bmatrix} \end{bmatrix}$

then

$\displaystyle \boldsymbol{X}\,\vert\,\boldsymbol{Y}=\boldsymbol{y} \sim {{\cal{N}}\big( \boldsymbol{\mu}_x + \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1}_y(\boldsymbol{y} - \boldsymbol{\mu}_y) , \boldsymbol{\Sigma}_x - \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1}_y\boldsymbol{\Sigma}^\top_{xy}\big)}$

and apply this to

$\displaystyle (\boldsymbol{X}_i\,\vert\, \boldsymbol{Y}_{i-1})\,\vert\,(\boldsymbol{Y}_i\,\vert\, \boldsymbol{Y}_{i-1})$

to give

$\displaystyle \boldsymbol{X}_i\,\vert\, \boldsymbol{Y}_{i} = \boldsymbol{y}_i \sim {{\cal{N}}\big( \hat{\boldsymbol{x}}^\flat_i + \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top \big(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i\big)^{-1} (\boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i) , \hat{\boldsymbol{\Sigma}}^\flat_i - \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i)^{-1}\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i\big)}$

This is called the measurement update; more explicitly

\displaystyle \begin{aligned} \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top \big(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i\big)^{-1} (\boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i) \\ \hat{\boldsymbol{\Sigma}}_i &\triangleq {\hat{\boldsymbol{\Sigma}}^\flat_i - \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i)^{-1}\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i} \end{aligned}

Sometimes the measurement residual $\boldsymbol{v}_i$, the measurement prediction covariance $\boldsymbol{S}_i$ and the filter gain $\boldsymbol{K}_i$ are defined and the measurement update is written as

\displaystyle \begin{aligned} \boldsymbol{v}_i & \triangleq \boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i \\ \boldsymbol{S}_i & \triangleq \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\ \boldsymbol{K}_i & \triangleq \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top\boldsymbol{S}^{-1}_i \\ \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \boldsymbol{K}_i\boldsymbol{v}_i \\ \hat{\boldsymbol{\Sigma}}_i &\triangleq \hat{\boldsymbol{\Sigma}}^\flat_i - \boldsymbol{K}_i\boldsymbol{S}_i\boldsymbol{K}^\top_i \end{aligned}

We further have that

\displaystyle \begin{aligned} {\boldsymbol{X}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} = {\boldsymbol{A}_i\boldsymbol{X}_{i-1}\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Psi}_{i-1}}\,\vert\,{\boldsymbol{Y}_{i-1}} = {\boldsymbol{A}_i\boldsymbol{X}_{i-1}\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Psi}_i} \end{aligned}

We thus obtain the Kalman filter prediction step:

\displaystyle \begin{aligned} \hat{\boldsymbol{x}}^\flat_i &= \boldsymbol{A}_{i-1}\hat{\boldsymbol{x}}_{i-1} \\ \hat{\boldsymbol{\Sigma}}^\flat_i &= \boldsymbol{A}_{i-1} \hat{\boldsymbol{\Sigma}}_{i-1} \boldsymbol{A}_{i-1}^\top + \boldsymbol{\Sigma}^{(x)}_{i-1} \end{aligned}

Further information can be found in (Boyd 2008), (Kleeman 1996) and (Särkkä 2013).

The hmatrix now uses type level literals via the DataKind extension in ghc to enforce compatibility of matrix and vector operations at the type level. See here for more details. Sadly a bug in the hmatrix implementation means we can’t currently use this excellent feature and we content ourselves with comments describing what the types would be were it possible to use it.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> module FunWithKalmanPart1a where
> import Numeric.LinearAlgebra.HMatrix hiding ( outer )
> import Data.Random.Source.PureMT
> import Data.Random hiding ( gamma )
> import qualified Control.Monad.Writer as W

Let us make our model almost deterministic but with noisy observations.

> stateVariance :: Double
> stateVariance = 1e-6
> obsVariance :: Double
> obsVariance = 1.0

And let us start with a prior normal distribution with a mean position and velocity of 0 with moderate variances and no correlation.

> -- muPrior :: R 2
> muPrior :: Vector Double
> muPrior = vector [0.0, 0.0]
> -- sigmaPrior :: Sq 2
> sigmaPrior :: Matrix Double
> sigmaPrior = (2 >< 2) [ 1e1,   0.0
>                       , 0.0,   1e1
>                       ]

We now set up the parameters for our model as outlined in the preceeding section.

> deltaT :: Double
> deltaT = 0.001
> -- bigA :: Sq 2
> bigA :: Matrix Double
> bigA = (2 >< 2) [ 1, deltaT
>                 , 0,      1
>                 ]
> a :: Double
> a = 1.0
> -- bigH :: L 1 2
> bigH :: Matrix Double
> bigH = (1 >< 2) [ a, 0
>                 ]
> -- bigSigmaY :: Sq 1
> bigSigmaY :: Matrix Double
> bigSigmaY = (1 >< 1) [ obsVariance ]
> -- bigSigmaX :: Sq 2
> bigSigmaX :: Matrix Double
> bigSigmaX = (2 >< 2) [ stateVariance, 0.0
>                      , 0.0,           stateVariance
>                      ]

The implementation of the Kalman filter using the hmatrix package is straightforward.

> -- outer ::  forall m n . (KnownNat m, KnownNat n) =>
> --           R n -> Sq n -> L m n -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer :: Vector Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> [Vector Double]
>          -> [(Vector Double, Matrix Double)]
> outer muPrior sigmaPrior bigH bigSigmaY bigA bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) ys
>
>     -- update :: (R n, Sq n) -> R m -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) y =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>         -- v :: R m
>         v = y - bigH #> xHatFlat
>         -- bigS :: Sq m
>         bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY
>         -- bigK :: L n m
>         bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS)
>         -- xHat :: R n
>         xHat = xHatFlat + bigK #> v
>         -- bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK)
>         -- xHatFlatNew :: R n
>         xHatFlatNew = bigA #> xHat
>         -- bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX

We create some ranodm data using our model parameters.

> singleSample ::(Double, Double) ->
>                RVarT (W.Writer [(Double, (Double, Double))]) (Double, Double)
> singleSample (xPrev, vPrev) = do
>   psiX <- rvarT (Normal 0.0 stateVariance)
>   let xNew = xPrev + deltaT * vPrev + psiX
>   psiV <- rvarT (Normal 0.0 stateVariance)
>   let vNew = vPrev + psiV
>   upsilon <- rvarT (Normal 0.0 obsVariance)
>   let y = a * xNew + upsilon
>   lift $W.tell [(y, (xNew, vNew))] > return (xNew, vNew) > streamSample :: RVarT (W.Writer [(Double, (Double, Double))]) (Double, Double) > streamSample = iterateM_ singleSample (1.0, 1.0) > samples :: ((Double, Double), [(Double, (Double, Double))]) > samples = W.runWriter (evalStateT (sample streamSample) (pureMT 2)) Here are the actual values of the randomly generated positions. > actualXs :: [Double] > actualXs = map (fst . snd)$ take nObs $snd samples > test :: [(Vector Double, Matrix Double)] > test = outer muPrior sigmaPrior bigH bigSigmaY bigA bigSigmaX > (map (\x -> vector [x])$ map fst $snd samples) And using the Kalman filter we can estimate the positions. > estXs :: [Double] > estXs = map (!!0)$ map toList $map fst$ take nObs test
> nObs :: Int
> nObs = 1000

And we can see that the estimates track the actual positions quite nicely.

Of course we really wanted to estimate the velocity.

> actualVs :: [Double]
> actualVs = map (snd . snd) $take nObs$ snd samples
> estVs :: [Double]
> estVs = map (!!1) $map toList$ map fst take nObs test # Bibliography Boyd, Stephen. 2008. “EE363 Linear Dynamical Systems.” http://stanford.edu/class/ee363. Kleeman, Lindsay. 1996. “Understanding and Applying Kalman Filtering.” In Proceedings of the Second Workshop on Perceptive Systems, Curtin University of Technology, Perth Western Australia (25-26 January 1996). Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. Vol. 3. Cambridge University Press. # Fun with (Kalman) Filters Part I Suppose we wish to estimate the mean of a sample drawn from a normal distribution. In the Bayesian approach, we know the prior distribution for the mean (it could be a non-informative prior) and then we update this with our observations to create the posterior, the latter giving us improved information about the distribution of the mean. In symbols $\displaystyle p(\theta \,\vert\, x) \propto p(x \,\vert\, \theta)p(\theta)$ Typically, the samples are chosen to be independent, and all of the data is used to perform the update but, given independence, there is no particular reason to do that, updates can performed one at a time and the result is the same; nor is the order of update important. Being a bit imprecise, we have $\displaystyle p(z \,\vert\, x, y) = p(z, x, y)p(x, y) = p(z, x, y)p(x)p(y) = p((z \,\vert\, x) \,\vert\, y) = p((z \,\vert\, y) \,\vert\, x)$ The standard notation in Bayesian statistics is to denote the parameters of interest as $\theta \in \mathbb{R}^p$ and the observations as $x \in \mathbb{R}^n$. For reasons that will become apparent in later blog posts, let us change notation and label the parameters as $x$ and the observations as $y$. Let us take a very simple example of a prior $X \sim {\cal{N}}(0, \sigma^2)$ where $\sigma^2$ is known and then sample from a normal distribution with mean $x$ and variance for the $i$-th sample $c_i^2$ where $c_i$ is known (normally we would not know the variance but adding this generality would only clutter the exposition unnecessarily). $\displaystyle p(y_i \,\vert\, x) = \frac{1}{\sqrt{2\pi c_i^2}}\exp\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg)$ The likelihood is then $\displaystyle p(\boldsymbol{y} \,\vert\, x) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi c_i^2}}\exp\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg)$ As we have already noted, instead of using this with the prior to calculate the posterior, we can update the prior with each observation separately. Suppose that we have obtained the posterior given $i - 1$ samples (we do not know this is normally distributed yet but we soon will): $\displaystyle p(x \,\vert\, y_1,\ldots,y_{i-1}) = {\cal{N}}(\hat{x}_{i-1}, \hat{\sigma}^2_{i-1})$ Then we have \displaystyle \begin{aligned} p(x \,\vert\, y_1,\ldots,y_{i}) &\propto p(y_i \,\vert\, x)p(x \,\vert\, y_1,\ldots,y_{i-1}) \\ &\propto \exp-\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg) \exp-\bigg(\frac{(x - \hat{x}_{i-1})^2}{2\hat{\sigma}_{i-1}^2}\bigg) \\ &\propto \exp-\Bigg(\frac{x^2}{c_i^2} - \frac{2xy_i}{c_i^2} + \frac{x^2}{\hat{\sigma}_{i-1}^2} - \frac{2x\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}\Bigg) \\ &\propto \exp-\Bigg( x^2\Bigg(\frac{1}{c_i^2} + \frac{1}{\hat{\sigma}_{i-1}^2}\Bigg) - 2x\Bigg(\frac{y_i}{c_i^2} + \frac{\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}\Bigg)\Bigg) \end{aligned} Writing $\displaystyle \frac{1}{\hat{\sigma}_{i}^2} \triangleq \frac{1}{c_i^2} + \frac{1}{\hat{\sigma}_{i-1}^2}$ and then completing the square we also obtain $\displaystyle \frac{\hat{x}_{i}}{\hat{\sigma}_{i}^2} \triangleq \frac{y_i}{c_i^2} + \frac{\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}$ # More Formally Now let’s be a bit more formal about conditional probability and use the notation of $\sigma$-algebras to define ${\cal{F}}_i = \sigma\{Y_1,\ldots, Y_i\}$ and $M_i \triangleq \mathbb{E}(X \,\vert\, {\cal{F}}_i)$ where $Y_i = X + \epsilon_i$, $X$ is as before and $\epsilon_i \sim {\cal{N}}(0, c_k^2)$. We have previously calculated that $M_i = \hat{x}_i$ and that ${\cal{E}}((X - M_i)^2 \,\vert\, Y_1, \ldots Y_i) = \hat{\sigma}_{i}^2$ and the tower law for conditional probabilities then allows us to conclude ${\cal{E}}((X - M_i)^2) = \hat{\sigma}_{i}^2$. By Jensen’s inequality, we have $\displaystyle {\cal{E}}(M_i^2) = {\cal{E}}({\cal{E}}(X \,\vert\, {\cal{F}}_i)^2)) \leq {\cal{E}}({\cal{E}}(X^2 \,\vert\, {\cal{F}}_i))) = {\cal{E}}(X^2) = \sigma^2$ Hence $M$ is bounded in $L^2$ and therefore converges in $L^2$ and almost surely to $M_\infty \triangleq {\cal{E}}(X \,\vert\, {\cal{F}}_\infty)$. The noteworthy point is that if $M_\infty = X$ if and only if $\hat{\sigma}_i$ converges to 0. Explicitly we have $\displaystyle \frac{1}{\hat{\sigma}_i^2} = \frac{1}{\sigma^2} + \sum_{k=1}^i\frac{1}{c_k^2}$ which explains why we took the observations to have varying and known variances. You can read more in Williams’ book (Williams 1991). # A Quick Check We have reformulated our estimation problem as a very simple version of the celebrated Kalman filter. Of course, there are much more interesting applications of this but for now let us try “tracking” the sample from the random variable. > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-} > module FunWithKalmanPart1 ( > obs > , nObs > , estimates > , uppers > , lowers > ) where > > import Data.Random.Source.PureMT > import Data.Random > import Control.Monad.State > var, cSquared :: Double > var = 1.0 > cSquared = 1.0 > > nObs :: Int > nObs = 100 > createObs :: RVar (Double, [Double]) > createObs = do > x <- rvar (Normal 0.0 var) > ys <- replicateM nObs rvar (Normal x cSquared)
>   return (x, ys)
>
> obs :: (Double, [Double])
> obs = evalState (sample createObs) (pureMT 2)
>
> updateEstimate :: (Double, Double) -> (Double, Double) -> (Double, Double)
> updateEstimate (xHatPrev, varPrev) (y, cSquared) = (xHatNew, varNew)
>   where
>     varNew  = recip (recip varPrev + recip cSquared)
>     xHatNew = varNew * (y / cSquared + xHatPrev / varPrev)
>
> estimates :: [(Double, Double)]
> estimates = scanl updateEstimate (y, cSquared) (zip ys (repeat cSquared))
>   where
>     y  = head $snd obs > ys = tail$ snd obs
>
> uppers :: [Double]
> uppers = map (\(x, y) -> x + 3 * (sqrt y)) estimates
>
> lowers :: [Double]
> lowers = map (\(x, y) -> x - 3 * (sqrt y)) estimates

# Bibliography

Williams, David. 1991. Probability with Martingales. Cambridge University Press.

This is really intended as a draft chapter for our book. Given the diverse natures of the intended intended audiences, it is probably a bit light on explanation of the Haskell (use of monad transformers) for those with a background in numerical methods. It is hoped that the explanation of the mathematics is adequate for those with a background in Haskell but not necessarily in numerical methods. As always, any feedback is gratefully accepted.

# Introduction

Imagine an insect, a grasshopper, trapped on the face of a clock which wants to visit each hour an equal number of times. However, there is a snag: it can only see the value of the hour it is on and the value of the hours immediately anti-clockwise and immediately clockwise. For example, if it is standing on 5 then it can see the 5, the 4, and the 6 but no others.

It can adopt the following strategy: toss a fair coin and move anti-clockwise for a head and move clockwise for a tail. Intuition tells us that over a large set of moves the grasshopper will visit each hour (approximately) the same number of times.

Can we confirm our intuition somehow? Suppose that the strategy has worked and the grasshopper is now to be found with equal probability on any hour. Then at the last jump, the grasshopper must either have been at the hour before the one it is now on or it must have been at the hour after the one it is now on. Let us denote the probability that the grasshopper is on hour $n$ by $\pi(n)$ and the (conditional) probability that the grasshopper jumps to state $n$ given it was in state $m$ by $p(n \, |\, m)$. Then we have

$\displaystyle \pi'(n) = p(n \, |\, n - 1)\pi(n - 1) + p(n \, |\, n + 1)\pi(n + 1)$

Substituting in where $N$ is a normalising constant (12 in this case) we obtain

$\displaystyle \pi'(n) = \frac{1}{2}\frac{1}{N} + \frac{1}{2}\frac{1}{N} = \frac{1}{N}$

This tells us that the required distribution is a fixed point of the grasshopper’s strategy. But does the strategy actually converge to the fixed point? Let us perform an experiment.

First we import some modules from hmatrix.

> {-# LANGUAGE FlexibleContexts #-}
> module Chapter1 where
> import Data.Packed.Matrix
> import Numeric.LinearAlgebra.Algorithms
> import Numeric.Container
> import Data.Random
> import qualified Control.Monad.Writer as W
> import qualified Control.Monad.Loops as ML
> import Data.Random.Source.PureMT

Let us use a clock with 5 hours to make the matrices sufficiently small to fit on one page.

Here is the strategy encoded as a matrix. For example the first row says jump to position 1 with probablity 0.5 or jump to position 5 with probability 0.5.

> eqProbsMat :: Matrix Double
> eqProbsMat = (5 >< 5)
>         [ 0.0, 0.5, 0.0, 0.0, 0.5
>         , 0.5, 0.0, 0.5, 0.0, 0.0
>         , 0.0, 0.5, 0.0, 0.5, 0.0
>         , 0.0, 0.0, 0.5, 0.0, 0.5
>         , 0.5, 0.0, 0.0, 0.5, 0.0
>         ]

We suppose the grasshopper starts at 1 o’clock.

> startOnOne :: Matrix Double
> startOnOne = ((1 >< 5) [1.0, 0.0, 0.0, 0.0, 0.0])

If we allow the grasshopper to hop 1000 times then we see that it is equally likely to be found on any hour hand with a 20% probability.

ghci> eqProbsMat
(5><5)
[ 0.0, 0.5, 0.0, 0.0, 0.5
, 0.5, 0.0, 0.5, 0.0, 0.0
, 0.0, 0.5, 0.0, 0.5, 0.0
, 0.0, 0.0, 0.5, 0.0, 0.5
, 0.5, 0.0, 0.0, 0.5, 0.0 ]

ghci> take 1 $drop 1000$ iterate (<> eqProbsMat) startOnOne
[(1><5)
[ 0.20000000000000007, 0.2, 0.20000000000000004, 0.20000000000000004, 0.2 ]]

In this particular case, the strategy does indeed converge.

Now suppose the grasshopper wants to visit each hour in proportion the value of the number on the hour. Lacking pen and paper (and indeed opposable thumbs), it decides to adopt the following strategy: toss a fair coin as in the previous strategy but only move if the number is larger than the one it is standing on; if, on the other hand, the number is smaller then choose a number at random from between 0 and 1 and move if this value is smaller than the ratio of the proposed hour and the hour on which it is standing otherwise stay put. For example, if the grasshopper is standing on 5 and gets a tail then it will move to 6 but if it gets a head then four fifths of the time it will move to 4 but one fifth of the time it will stay where it is.

Suppose that the strategy has worked (it is not clear that is has) and the grasshopper is now to be found at 12 o’clock 12 times as often as at 1 o’clock, at 11 o’clock 11 times as often as at 1 o’clock, etc. Then at the last jump, the grasshopper must either have been at the hour before the one it is now on, the hour after the one it is now on or the same hour it is now on. Let us denote the probability that the grasshopper is on hour $n$ by $\pi(n)$.

$\displaystyle \pi'(n) = p(n \, |\, n - 1)\pi(n - 1) + p(n \, |\, n)\pi(n) + p(n \, |\, n + 1)\pi(n + 1)$

Substituting in at 4 say

\displaystyle \begin{aligned} \pi'(4) &= \frac{1}{2}\pi(3) + \frac{1}{2}\frac{1}{4}\pi(4) + \frac{1}{2}\frac{4}{5}\pi(5) \\ &= \frac{1}{2}\bigg(\frac{3}{N} + \frac{1}{4}\frac{4}{N} + \frac{4}{5}\frac{5}{N}\bigg) \\ &= \frac{1}{N}\frac{8}{2} \\ &= \frac{4}{N} \\ &= \pi(4) \end{aligned}

The reader can check that this relationship holds for all other hours. This tells us that the required distribution is a fixed point of the grasshopper’s strategy. But does this strategy actually converge to the fixed point?

Again, let us use a clock with 5 hours to make the matrices sufficiently small to fit on one page.

Here is the strategy encoded as a matrix. For example the first row says jump to position 1 with probablity 0.5 or jump to position 5 with probability 0.5.

> incProbsMat :: Matrix Double
> incProbsMat = scale 0.5 $> (5 >< 5) > [ 0.0, 1.0, 0.0, 0.0, 1.0 > , 1.0/2.0, 1.0/2.0, 1.0, 0.0, 0.0 > , 0.0, 2.0/3.0, 1.0/3.0, 1.0, 0.0 > , 0.0, 0.0, 3.0/4.0, 1.0/4.0, 1.0 > , 1.0/5.0, 0.0, 0.0, 4.0/5.0, 1.0/5.0 + 4.0/5.0 > ] We suppose the grasshopper starts at 1 o’clock. If we allow the grasshopper to hop 1000 times then we see that it is equally likely to be found on any hour hand $n$ with a probability of $n$ times the probability of being found on 1. ghci> incProbsMat (5><5) [ 0.0, 0.5, 0.0, 0.0, 0.5 , 0.25, 0.25, 0.5, 0.0, 0.0 , 0.0, 0.3333333333333333, 0.16666666666666666, 0.5, 0.0 , 0.0, 0.0, 0.375, 0.125, 0.5 , 0.1, 0.0, 0.0, 0.4, 0.5 ] ghci> take 1$ drop 1000  iterate (<> incProbsMat) startOnOne [(1><5) [ 6.666666666666665e-2, 0.1333333333333333, 0.19999999999999996, 0.2666666666666666, 0.33333333333333326 ]] In this particular case, the strategy does indeed converge. Surprisingly, this strategy produces the desired result and is known as the Metropolis Algorithm. What the grasshopper has done is to construct a (discrete) Markov Process which has a limiting distribution (the stationary distribution) with the desired feature: sampling from this process will result in each hour being sampled in proportion to its value. # Markov Chain Theory Let us examine what is happening in a bit more detail. The grasshopper has started with a very simple Markov Chain: one which jumps clockwise or anti-clockwise with equal probability and then modified it. But what is a Markov Chain? A time homogeneous Markov chain is a countable sequence of random variables $X_0, X_1, \ldots$ such that $\displaystyle \mathbb{P} (X_{n+1} = j \,|\, X_0 = i_0, X_1 = i_1, \dots X_n = i) = \mathbb{P} (X_{n+1} = j \,|\, X_n = i)$ We sometimes say that a Markov Chain is discrete time stochastic process with the above property. So the very simple Markov Chain can be described by $\displaystyle q(i, j) = \begin{cases} \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = \frac{1}{2} & \text{if } j = i + 1 \mod N \\ \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = \frac{1}{2} & \text{if } j = i - 1 \mod N \\ \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = 0 & \text{otherwise } \end{cases}$ The grasshopper knows that $\pi(i) = i/N$ so it can calculate $\pi(j)/\pi(i) = j/i$ without knowing $N$. This is important because now, without knowing $N$, the grasshopper can evaluate $\displaystyle p(i, j) = \begin{cases} q(i,j)\bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j \ne i \\ 1 - \sum_{k : k \ne i} q(i,k) \bigg[\frac{\pi(k) q(k,i)}{\pi(i) q(i,k)} \land 1 \bigg] & \text{if } j = i \end{cases}$ where $\land$ takes the maximum of its arguments. Simplifying the above by substituing in the grasshopper’s probabilities and noting that $j = i \pm 1 \mod N$ is somewhat obscure way of saying jump clockwise or anti-clockwise we obtain $\displaystyle q(i, j) = \begin{cases} \frac{1}{2} (\frac{j}{i} \land 1) & \text{if } j \text{ is 1 step clockwise} \\ \frac{1}{2} (\frac{j}{i} \land 1) & \text{if } j \text{ is 1 step anti-clockwise} \\ 1 - \frac{1}{2}(\frac{j^c}{i} \land 1) - \frac{1}{2}(\frac{j^a}{i} \land 1) & \text{if } j = i \text{ and } j^c \text{ is one step clockwise and } j^a \text{ is one step anti-clockwise} \\ 0 & \text{otherwise} \end{cases}$ ## The Ergodic Theorem In most studies of Markov chains, one is interested in whether a chain has a stationary distribution. What we wish to do is take a distribution and create a chain with this distribution as its stationary distribution. We will still need to show that our chain does indeed have the correct stationary distribution and we state the relevant theorem somewhat informally and with no proof. ### Theorem An irreducible, aperiodic and positive recurrent Markov chain has a unique stationary distribution. Roughly speaking • Irreducible means it is possible to get from any state to any other state. • Aperiodic means that returning to a state having started at that state occurs at irregular times. • Positive recurrent means that the first time to hit a state is finite (for every state and more pedantically except on sets of null measure). Note that the last condition is required when the state space is infinite – see Skrikant‘s lecture notes for an example and also for a more formal definition of the theorem and its proof. ### Algorithm Let $\pi$ be a probability distribution on the state space $\Omega$ with $\pi(i) > 0$ for all $i$ and let $(Q, \pi_0)$ be an ergodic Markov chain on $\Omega$ with transition probabilities $q(i,j) > 0$ (the latter condition is slightly stronger than it need be but we will not need fully general conditions). Create a new (ergodic) Markov chain with transition probabilities $\displaystyle p_{ij} = \begin{cases} q(i,j)\bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j \ne i \\ 1 - \sum_{k : k \ne i} q(i,k) \bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j = i \end{cases}$ where $\land$ takes the maximum of its arguments. Calculate the value of interest on the state space e.g. the total magnetization for each step produced by this new chain. Repeat a sufficiently large number of times and take the average. This gives the estimate of the value of interest. ### Convergence Let us first note that the Markov chain produced by this algorithm almost trivially satisfies the detailed balance condition, for example, \displaystyle \begin{aligned} \pi(i) q(i,j)\bigg[\frac{\pi(j) q(j, i)}{\pi(i)q(i,j)} \land 1\bigg] &= \pi(i)q(i,j) \land \pi(j)q(j,i) \\ &= \pi(j)q(j,i)\bigg[\frac{\pi(i) q(i, j)}{\pi(j)q(j,i)} \land 1\bigg] \end{aligned} Secondly since we have specified that $(Q, \pi_0)$ is ergodic then clearly $(P, \pi_0)$ is also ergodic (all the transition probabilities are $> 0$). So we know the algorithm will converge to the unique distribution we specified to provide estimates of values of interest. # Gibbs Sampling ## Random Scan For simplicity let us consider a model with two parameters and that we sample from either parameter with equal probability. In this sampler, We update the parameters in a single step. $\displaystyle \begin{cases} \text{Sample } \theta_1^{(i+1)} \sim \pi(\theta_1 \,\big|\, \theta_2^{(i)}) & \text{with probability } \frac{1}{2} \\ \text{Sample } \theta_2^{(i+1)} \sim \pi(\theta_2 \,\big|\, \theta_1^{(i)}) & \text{with probability } \frac{1}{2} \end{cases}$ The transition density kernel is then given by $\displaystyle q\big(\boldsymbol{\theta}^{(i+1)}, \boldsymbol{\theta}^{(i)}\big) = \frac{1}{2}\pi(\theta_1^{(i+1)} \,\big|\, \theta_2^{(i)})\delta({\theta_2^{(i)},\theta_2^{(i+1)}}) + \frac{1}{2}\pi(\theta_2^{(i+1)} \,\big|\, \theta_1^{(i)})\delta({\theta_1^{(i)},\theta_1^{(i+1)}})$ where $\delta$ is the Dirac delta function. ### Detailed balance This sampling scheme satisifies the detailed balance condition. We have \displaystyle \begin{aligned} \pi(\theta_1, \theta_2) \bigg[ \frac{1}{2}\pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) + \frac{1}{2}\pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\ \frac{1}{2}\bigg[\pi(\theta_1, \theta_2) \pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) + \pi(\theta_1, \theta_2) \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\ \frac{1}{2}\bigg[\pi(\theta_1, \theta_2') \pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) + \pi(\theta_1', \theta_2) \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\ \frac{1}{2}\bigg[ \pi(\theta_2')\pi(\theta_1 \,\big|\, \theta_2') \frac{1}{2}\pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) + \pi(\theta_1')\pi(\theta_2 \,\big|\, \theta_1') \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'}) \bigg] &= \\ \frac{1}{2}\bigg[ \pi(\theta_1', \theta_2')\pi(\theta_1 \,\big|\, \theta_2') \delta({\theta_2',\theta_2}) + \pi(\theta_1', \theta_2')\pi(\theta_2 \,\big|\, \theta_1') \delta({\theta_1',\theta_1}) \bigg] &= \\ \pi(\theta_1', \theta_2')\bigg[ \frac{1}{2}\pi(\theta_1 \,\big|\, \theta_2') \delta({\theta_2',\theta_2}) + \frac{1}{2}\pi(\theta_2 \,\big|\, \theta_1') \delta({\theta_1',\theta_1}) \bigg] & \end{aligned} In other words $\displaystyle \pi\big({\boldsymbol{\theta}}\big)q\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big) = \pi\big({\boldsymbol{\theta'}}\big)q\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big)$ Hand waving slightly, we can see that this scheme satisfies the premises of the ergodic theorem and so we can conclude that there is a unique stationary distribution and $\pi$ must be that distribution. # Systematic Scan Most references on Gibbs sampling do not describe the random scan but instead something called a systematic scan. Again for simplicity let us consider a model with two parameters. In this sampler, we update the parameters in two steps. \displaystyle \begin{aligned} \text{Sample } \theta_1^{(i+1)} & \sim & \pi(\theta_1 \,\big|\, \theta_2^{(i)}) \\ \text{Sample } \theta_2^{(i+1)} & \sim & \pi(\theta_2 \,\big|\, \theta_1^{(i+1)}) \end{aligned} We observe that this is not time-homegeneous; at each step the transition matrix flips between the two transition matrices given by the individual steps. Thus although, as we show below, each individual transtion satisifies the detailed balance condition, we cannot apply the ergodic theorem as it only applies to time-homogeneous processes. The transition density kernel is then given by $\displaystyle q\big(\boldsymbol{\theta}^{(i)}, \boldsymbol{\theta}^{(i+1)}\big) = q_1\big(\boldsymbol{\theta}^{(i)}, \tilde{\boldsymbol{\theta}}\big) q_2\big(\tilde{\boldsymbol{\theta}}, \boldsymbol{\theta}^{(i+1)}\big)$ where $\tilde{\boldsymbol{\theta}} = (\theta_1^{(i+1)}, \theta_2^{(i)})^\top$. Thus $\displaystyle q\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) = \pi(\theta_1' \,\big|\, \theta_2) \pi(\theta_2' \,\big|\, \theta_1')$ ### Detailed balance Suppose that we have two states $\boldsymbol{\theta} = (\theta_1, \theta_2)^\top$ and $\boldsymbol{\theta}' = (\theta_1', \theta_2')^\top$ and that $\theta_2 \neq \theta_2'$. Then $q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) = 0$. Trivially we have $\displaystyle \pi\big({\boldsymbol{\theta}}\big)q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) = \pi\big({\boldsymbol{\theta'}}\big)q_1\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)$ Now suppose that $\theta_2 = \theta_2'$ \displaystyle \begin{aligned} \pi(\theta_1, \theta_2)q_1((\theta_1, \theta_2), (\theta_1', \theta_2)) & = \pi(\theta_1, \theta_2)\pi(\theta_1' \,\big|\, \theta_2) \\ & = \pi(\theta_1 \,\big|\, \theta_2)\pi(\theta_1', \theta_2) \\ & = \pi(\theta_1 \,\big|\, \theta_2')\pi(\theta_1', \theta_2') \\ & = \pi(\theta_1', \theta_2')q_1((\theta_1', \theta_2), (\theta_1, \theta_2)) \end{aligned} So again we have $\displaystyle \pi\big({\boldsymbol{\theta}}\big)q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) = \pi\big({\boldsymbol{\theta'}}\big)q_1\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)$ Similarly we can show $\displaystyle \pi\big({\boldsymbol{\theta}}\big)q_2\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) = \pi\big({\boldsymbol{\theta'}}\big)q_2\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)$ But note that \displaystyle \begin{aligned} \pi(\theta_1, \theta_2) q_1((\theta_1, \theta_2), (\theta_1', \theta_2)) q_2((\theta_1', \theta_2), (\theta_1', \theta_2')) & = \pi(\theta_1, \theta_2) \pi(\theta_1' \,\big|\, \theta_2) \pi(\theta_2' \,\big|\, \theta_1') \\ & = \pi(\theta_1', \theta_2) \pi(\theta_1 \,\big|\, \theta_2) \pi(\theta_2' \,\big|\, \theta_1') \\ & = \pi(\theta_1' \,\big|\, \theta_2) \pi(\theta_1 \,\big|\, \theta_2) \pi(\theta_2', \theta_1') \end{aligned} whereas \displaystyle \begin{aligned} \pi(\theta_1', \theta_2') q_1((\theta_1', \theta_2'), (\theta_1, \theta_2')) q_2((\theta_1, \theta_2'), (\theta_1, \theta_2)) & = \pi(\theta_1', \theta_2') \pi(\theta_1 \,\big|\, \theta_2') \pi(\theta_2 \,\big|\, \theta_1) \\ & = \pi(\theta_1, \theta_2') \pi(\theta_1' \,\big|\, \theta_2') \pi(\theta_2 \,\big|\, \theta_1) \\ & = \pi(\theta_2' \,\big|\, \theta_1) \pi(\theta_1' \,\big|\, \theta_2') \pi(\theta_2, \theta_1) \end{aligned} and these are not necessarily equal. So the detailed balance equation is not satisfied, another sign that we cannot appeal to the ergodic theorem. # An Example: The Bivariate Normal Let us demonstrate the Gibbs sampler with a distribution which we actually know: the bivariate normal. $\displaystyle \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} \bigg| y \sim N \begin{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} & \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix} \end{bmatrix}$ The conditional distributions are easily calculated to be \displaystyle \begin{aligned} \theta_1 \,\vert\, \theta_2, y &\sim {\cal{N}}(y_1 + \rho(\theta_2 - y_2), 1 - \rho^2) \\ \theta_2 \,\vert\, \theta_1, y &\sim {\cal{N}}(y_2 + \rho(\theta_1 - y_1), 1 - \rho^2) \end{aligned} Let’s take a correlation of 0.8, a data point of (0.0, 0.0) and start the chain at (2.5, 2.5). > rho :: Double > rho = 0.8 > > y :: (Double, Double) > y = (0.0, 0.0) > > y1, y2 :: Double > y1 = fst y > y2 = snd y > > initTheta :: (Double, Double) > initTheta = (2.5, 2.5) We pre-calculate the variance needed for the sampler. > var :: Double > var = 1.0 - rho^2 In Haskell and in the random-fu package, sampling from probability distributions is implemented as a monad. We sample from the relevant normal distributions and keep the trajectory using a writer monad. > gibbsSampler :: Double -> RVarT (W.Writer [(Double,Double)]) Double > gibbsSampler oldTheta2 = do > newTheta1 <- rvarT (Normal (y1 + rho * (oldTheta2 - y2)) var) > lift W.tell [(newTheta1, oldTheta2)]
>   newTheta2 <- rvarT (Normal (y2 + rho * (newTheta1 - y1)) var)
>   lift $W.tell [(newTheta1, newTheta2)] > return$ newTheta2

It is common to allow the chain to “burn in” so as to “forget” its starting position. We arbitrarily burn in for 10,000 steps.

> burnIn :: Int
> burnIn = 10000

We sample repeatedly from the sampler using the monadic form of iterate. Running the monadic stack is slightly noisy but nonetheless straightforward. We use mersenne-random-pure64 (albeit indirectly via random-source) as our source of entropy.

> runMCMC :: Int -> [(Double, Double)]
> runMCMC n =
>   take n $> drop burnIn$
>   snd > W.runWriter (evalStateT (sample (ML.iterateM_ gibbsSampler (snd initTheta))) (pureMT 2)) We can look at the trajectory of our sampler for various run lengths. For bigger sample sizes, plotting the distribution sampled re-assures us that we are indeed sampling from a bivariate normal distribution as the theory predicted. # Applications to Bayesian Statistics Some of what is here and here excluding JAGS and STAN (after all this is a book about Haskell). Applications to Physics # Applications to Physics Most of what is here. # Gibbs Sampling in R, Haskell, Jags and Stan # Introduction It’s possible to Gibbs sampling in most languages and since I am doing some work in R and some work in Haskell, I thought I’d present a simple example in both languages: estimating the mean from a normal distribution with unknown mean and variance. Although one can do Gibbs sampling directly in R, it is more common to use a specialised language such as JAGS or STAN to do the actual sampling and do pre-processing and post-processing in R. This blog post presents implementations in native R, JAGS and STAN as well as Haskell. ## Preamble > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-} > {-# LANGUAGE NoMonomorphismRestriction #-} > module Gibbs ( > main > , m > , Moments(..) > ) where > > import qualified Data.Vector.Unboxed as V > import qualified Control.Monad.Loops as ML > import Data.Random.Source.PureMT > import Data.Random > import Control.Monad.State > import Data.Histogram ( asList ) > import Data.Histogram.Fill > import Data.Histogram.Generic ( Histogram ) > import Data.List > import qualified Control.Foldl as L > > import Diagrams.Backend.Cairo.CmdLine > > import LinRegAux > > import Diagrams.Backend.CmdLine > import Diagrams.Prelude hiding ( sample, render ) The length of our chain and the burn-in. > nrep, nb :: Int > nb = 5000 > nrep = 105000 Data generated from ${\cal{N}}(10.0, 5.0)$. > xs :: [Double] > xs = [ > 11.0765808082301 > , 10.918739177542 > , 15.4302462747137 > , 10.1435649220266 > , 15.2112705014697 > , 10.441327659703 > , 2.95784054883142 > , 10.2761068139607 > , 9.64347295100318 > , 11.8043359297675 > , 10.9419989262713 > , 7.21905367667346 > , 10.4339807638017 > , 6.79485294803006 > , 11.817248658832 > , 6.6126710570584 > , 12.6640920214508 > , 8.36604701073303 > , 12.6048485320333 > , 8.43143879537592 > ] # A Bit of Theory ## Gibbs Sampling For a multi-parameter situation, Gibbs sampling is a special case of Metropolis-Hastings in which the proposal distributions are the posterior conditional distributions. Referring back to the explanation of the metropolis algorithm, let us describe the state by its parameters $i \triangleq \boldsymbol{\theta}^{(i)} \triangleq (\theta^{(i)}_1,\ldots, \theta^{(i)}_n)$ and the conditional posteriors by $\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big)$ where ${\boldsymbol{\theta}}^{(i)}_{-k} = \big(\theta_1^{(i)},\ldots,\theta_{k-1}^{(i)},\theta_{k+1}^{(i)}\ldots\theta_n^{(i)}\big)$ then \displaystyle \begin{aligned} \frac{\pi\big(\boldsymbol{\theta}^{(j)}\big)q\big(\boldsymbol{\theta}^{(j)}, \boldsymbol{\theta}^{(i)}\big)} {\pi(\boldsymbol{\theta}^{(i)})q(\boldsymbol{\theta}^{(i)}, \boldsymbol{\theta}^{(j)})} &= \frac{ \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big) } { \pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(i)}\big)\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big) } \\ &= \frac{ \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big) } { \pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big) } \\ &= 1 \end{aligned} where we have used the rules of conditional probability and the fact that $\boldsymbol{\theta}_i^{(-k)} = \boldsymbol{\theta}_j^{(-k)}$ Thus we always accept the proposed jump. Note that the chain is not in general reversible as the order in which the updates are done matters. ## Normal Distribution with Unknown Mean and Variance It is fairly standard to use an improper prior \displaystyle \begin{aligned} \pi(\mu, \tau) \propto \frac{1}{\tau} & & -\infty < \mu < \infty\, \textrm{and}\, 0 < \tau < \infty \end{aligned} The likelihood is $\displaystyle p(\boldsymbol{x}\,|\,\mu, \sigma) = \prod_{i=1}^n \bigg(\frac{1}{\sigma\sqrt{2\pi}}\bigg)\exp{\bigg( -\frac{(x_i - \mu)^2}{2\sigma^2}\bigg)}$ re-writing in terms of precision $\displaystyle p(\boldsymbol{x}\,|\,\mu, \tau) \propto \prod_{i=1}^n \sqrt{\tau}\exp{\bigg( -\frac{\tau}{2}{(x_i - \mu)^2}\bigg)} = \tau^{n/2}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)}$ Thus the posterior is $\displaystyle p(\mu, \tau \,|\, \boldsymbol{x}) \propto \tau^{n/2 - 1}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)}$ We can re-write the sum in terms of the sample mean $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ and variance $s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$ using \displaystyle \begin{aligned} \sum_{i=1}^n (x_i - \mu)^2 &= \sum_{i=1}^n (x_i - \bar{x} + \bar{x} - \mu)^2 \\ &= \sum_{i=1}^n (x_i - \bar{x})^2 - 2\sum_{i=1}^n (x_i - \bar{x})(\bar{x} - \mu) + \sum_{i=1}^n (\bar{x} - \mu)^2 \\ &= \sum_{i=1}^n (x_i - \bar{x})^2 - 2(\bar{x} - \mu)\sum_{i=1}^n (x_i - \bar{x}) + \sum_{i=1}^n (\bar{x} - \mu)^2 \\ &= (n - 1)s^2 + n(\bar{x} - \mu)^2 \end{aligned} Thus the conditional posterior for $\mu$ is \displaystyle \begin{aligned} p(\mu \,|\, \tau, \boldsymbol{x}) &\propto \exp{\bigg( -\frac{\tau}{2}\bigg(\nu s^2 + \sum_{i=1}^n{(\mu - \bar{x})^2}\bigg)\bigg)} \\ &\propto \exp{\bigg( -\frac{n\tau}{2}{(\mu - \bar{x})^2}\bigg)} \\ \end{aligned} which we recognise as a normal distribution with mean of $\bar{x}$ and a variance of $(n\tau)^{-1}$. The conditional posterior for $\tau$ is \displaystyle \begin{aligned} p(\tau \,|\, , \mu, \boldsymbol{x}) &\propto \tau^{n/2 -1}\exp\bigg(-\tau\frac{1}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg) \end{aligned} which we recognise as a gamma distribution with a shape of $n/2$ and a scale of $\frac{1}{2}\sum_{i=1}^n{(x_i - \mu)^2}$ In this particular case, we can calculate the marginal posterior of $\mu$ analytically. Writing $z = \frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}$ we have \displaystyle \begin{aligned} p(\mu \,|\, \boldsymbol{x}) &= \int_0^\infty p(\mu, \tau \,|\, \boldsymbol{x}) \textrm{d}\tau \\ &\propto \int_0^\infty \tau^{n/2 - 1}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)} \textrm{d}\tau \\ &\propto \bigg( \sum_{i=1}^n{(x_i - \mu)^2} \bigg)^{-n/2} \int_0^\infty z^{n/2 - 1}\exp{-z}\textrm{d}\tau \\ &\propto \bigg( \sum_{i=1}^n{(x_i - \mu)^2} \bigg)^{-n/2} \\ \end{aligned} Finally we can calculate \displaystyle \begin{aligned} p(\mu \,|\, \boldsymbol{x}) &\propto \bigg( (n - 1)s^2 + n(\bar{x} - \mu)^2 \bigg)^{-n/2} \\ &\propto \bigg( 1 + \frac{n(\mu - \bar{x})^2}{(n - 1)s^2} \bigg)^{-n/2} \\ \end{aligned} This is the non-standardized Student’s t-distribution $t_{n-1}(\bar{x}, s^2/n)$. Alternatively the marginal posterior of $\mu$ is $\displaystyle \frac{\mu - \bar{x}}{s/\sqrt{n}}\bigg|\, x \sim t_{n-1}$ where $t_{n-1}$ is the standard t distribution with $n - 1$ degrees of freedom. # The Model in Haskell Following up on a comment from a previous blog post, let us try using the foldl package to calculate the length, the sum and the sum of squares traversing the list only once. An improvement on creating your own strict record and using foldl’ but maybe it is not suitable for some methods e.g. calculating the skewness and kurtosis incrementally, see below. > x2Sum, xSum, n :: Double > (x2Sum, xSum, n) = L.fold stats xs > where > stats = (,,) <>
>             (L.premap (\x -> x * x) L.sum) <*>
>             L.sum <*>
>             L.genericLength

And re-writing the sample variance $s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$ using

\displaystyle \begin{aligned} \sum_{i=1}^n (x_i - \bar{x})^2 &= \sum_{i=1}^n (x_i^2 - 2x_i\bar{x} + \bar{x}^2) \\ &= \sum_{i=1}^n x_i^2 - 2\bar{x}\sum_{i=1}^n x_i + \sum_{i=1}^n \bar{x}^2 \\ &= \sum_{i=1}^n x_i^2 - 2n\bar{x}^2 + n\bar{x}^2 \\ &= \sum_{i=1}^n x_i^2 - n\bar{x}^2 \\ \end{aligned}

we can then calculate the sample mean and variance using the sums we have just calculated.

> xBar, varX :: Double
> xBar = xSum / n
> varX = n * (m2Xs - xBar * xBar) / (n - 1)
>   where m2Xs = x2Sum / n

In random-fu, the Gamma distribution is specified by the rate paratmeter, $\beta$.

> beta, initTau :: Double
> beta = 0.5 * n * varX
> initTau = evalState (sample (Gamma (n / 2) beta)) (pureMT 1)

Our sampler takes an old value of $\tau$ and creates new values of $\mu$ and $\tau$.

> gibbsSampler :: MonadRandom m => Double -> m (Maybe ((Double, Double), Double))
> gibbsSampler oldTau = do
>   newMu <- sample (Normal xBar (recip (sqrt (n * oldTau))))
>   let shape = 0.5 * n
>       scale = 0.5 * (x2Sum + n * newMu^2 - 2 * n * newMu * xBar)
>   newTau <- sample (Gamma shape (recip scale))