# Introduction

The equation of motion for a pendulum of unit length subject to Gaussian white noise is

$\displaystyle \frac{\mathrm{d}^2\alpha}{\mathrm{d}t^2} = -g\sin\alpha + w(t)$

We can discretize this via the usual Euler method

$\displaystyle \begin{bmatrix} x_{1,i} \\ x_{2,i} \end{bmatrix} = \begin{bmatrix} x_{1,i-1} + x_{2,i-1}\Delta t \\ x_{2,i-1} - g\sin x_{1,i-1}\Delta t \end{bmatrix} + \mathbf{q}_{i-1}$

where $q_i \sim {\mathcal{N}}(0,Q)$ and

$\displaystyle Q = \begin{bmatrix} \frac{q^c \Delta t^3}{3} & \frac{q^c \Delta t^2}{2} \\ \frac{q^c \Delta t^2}{2} & {q^c \Delta t} \end{bmatrix}$

The explanation of the precise form of the covariance matrix will be the subject of another blog post; for the purpose of exposition of forward filtering / backward smoothing, this detail is not important.

Assume that we can only measure the horizontal position of the pendulum and further that this measurement is subject to error so that

$\displaystyle y_i = \sin x_i + r_k$

where $r_i \sim {\mathcal{N}}(0,R)$.

Particle Filtering can give us an estimate of where the pendulum is and its velocity using all the observations up to that point in time. But now suppose we have observed the pendulum for a fixed period of time then at times earlier than the time at which we stop our observations we now have observations in the future as well as in the past. If we can somehow take account of these future observations then we should be able to improve our estimate of where the pendulum was at any given point in time (as well as its velocity). Forward Filtering / Backward Smoothing is a technique for doing this.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}

> {-# LANGUAGE MultiParamTypeClasses        #-}
> {-# LANGUAGE TypeFamilies                 #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE ExplicitForAll               #-}
> {-# LANGUAGE DataKinds                    #-}

> {-# LANGUAGE FlexibleInstances            #-}
> {-# LANGUAGE MultiParamTypeClasses        #-}
> {-# LANGUAGE FlexibleContexts             #-}
> {-# LANGUAGE TypeFamilies                 #-}
> {-# LANGUAGE BangPatterns                 #-}
> {-# LANGUAGE GeneralizedNewtypeDeriving   #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE DeriveGeneric                #-}

> module PendulumSamples ( pendulumSamples
>                        , pendulumSamples'
>                        , testFiltering
>                        , testSmoothing
>                        , testFilteringG
>                        , testSmoothingG
>                        ) where

> import           Data.Random hiding ( StdNormal, Normal )
> import           Data.Random.Source.PureMT ( pureMT )
> import           Control.Monad.State ( evalState, replicateM )
> import qualified Control.Monad.Loops as ML
> import           Control.Monad.Writer ( tell, WriterT, lift,
>                                         runWriterT
>                                       )
> import           Numeric.LinearAlgebra.Static
>                  ( R, vector, Sym,
>                    diag
>                  )
> import           GHC.TypeLits ( KnownNat )
> import           MultivariateNormal ( MultivariateNormal(..) )
> import qualified Data.Vector as V
> import           Data.Bits ( shiftR )
> import           Data.List ( transpose )
> import           Control.Parallel.Strategies
> import           GHC.Generics (Generic)


# Simulation

Let’s first plot some typical trajectories of the pendulum.

> deltaT, g :: Double
> deltaT = 0.01
> g  = 9.81

> type PendulumState = R 2
> type PendulumObs = R 1

> pendulumSample :: MonadRandom m =>
>                   Sym 2 ->
>                   Sym 1 ->
>                   PendulumState ->
>                   m (Maybe ((PendulumState, PendulumObs), PendulumState))
> pendulumSample bigQ bigR xPrev = do
>   let x1Prev = fst $headTail xPrev > x2Prev = fst$ headTail $snd$ headTail xPrev
>   eta <- sample $rvar (MultivariateNormal 0.0 bigQ) > let x1= x1Prev + x2Prev * deltaT > x2 = x2Prev - g * (sin x1Prev) * deltaT > xNew = vector [x1, x2] + eta > x1New = fst$ headTail xNew
>   epsilon <-  sample $rvar (MultivariateNormal 0.0 bigR) > let yNew = vector [sin x1New] + epsilon > return$ Just ((xNew, yNew), xNew)


Let’s try plotting some samples when we are in the linear region with which we are familiar from school $\sin\alpha \approx \alpha$.

$\displaystyle \frac{\mathrm{d}^2\alpha}{\mathrm{d}t^2} = -g\alpha + w(t)$

In this case we expect the horizontal displacement to be approximately equal to the angle of displacement and thus the observations to be symmetric about the actuals.

> bigQ :: Sym 2
> bigQ = sym $matrix bigQl  > qc1 :: Double > qc1 = 0.0001  > bigQl :: [Double] > bigQl = [ qc1 * deltaT^3 / 3, qc1 * deltaT^2 / 2, > qc1 * deltaT^2 / 2, qc1 * deltaT > ]  > bigR :: Sym 1 > bigR = sym$ matrix [0.0001]

> m0 :: PendulumState
> m0 = vector [0.01, 0]

> pendulumSamples :: [(PendulumState, PendulumObs)]
> pendulumSamples = evalState (ML.unfoldrM (pendulumSample bigQ bigR) m0) (pureMT 17)


But if we work in a region in which linearity breaks down then the observations are no longer symmetrical about the actuals.

> bigQ' :: Sym 2
> bigQ' = sym $matrix bigQl'  > qc1' :: Double > qc1' = 0.01  > bigQl' :: [Double] > bigQl' = [ qc1' * deltaT^3 / 3, qc1' * deltaT^2 / 2, > qc1' * deltaT^2 / 2, qc1' * deltaT > ]  > bigR' :: Sym 1 > bigR' = sym$ matrix [0.1]

> m0' :: PendulumState
> m0' = vector [1.6, 0]

> pendulumSamples' :: [(PendulumState, PendulumObs)]
> pendulumSamples' = evalState (ML.unfoldrM (pendulumSample bigQ' bigR') m0') (pureMT 17)


# Filtering

We do not give the theory behind particle filtering. The interested reader can either consult Särkkä (2013) or wait for a future blog post on the subject.

> nParticles :: Int
> nParticles = 500


The usual Bayesian update step.

> type Particles a = V.Vector a

> oneFilteringStep ::
>   (Particles a -> m (Particles a)) ->
>   (Particles a -> Particles b) ->
>   (b -> b -> Double) ->
>   Particles a ->
>   b ->
>   WriterT [Particles a] m (Particles a)
> oneFilteringStep stateUpdate obsUpdate weight statePrevs obs = do
>   tell [statePrevs]
>   stateNews <- lift $stateUpdate statePrevs > let obsNews = obsUpdate stateNews > let weights = V.map (weight obs) obsNews > cumSumWeights = V.tail$ V.scanl (+) 0 weights
>       totWeight     = V.last cumSumWeights
>   vs <- lift $V.replicateM nParticles (sample$ uniform 0.0 totWeight)
>   let js = indices cumSumWeights vs
>       stateTildes = V.map (stateNews V.!) js
>   return stateTildes


The system state and observable.

> data SystemState a = SystemState { x1  :: a, x2  :: a }
>   deriving (Show, Generic)

> instance NFData a => NFData (SystemState a)

> newtype SystemObs a = SystemObs { y1  :: a }
>   deriving Show


To make the system state update a bit more readable, let’s introduce some lifted arithmetic operators.

> (.+), (.*), (.-) :: (Num a) => V.Vector a -> V.Vector a -> V.Vector a
> (.+) = V.zipWith (+)
> (.*) = V.zipWith (*)
> (.-) = V.zipWith (-)


The state update itself

> stateUpdate :: Particles (SystemState Double) ->
>                 Particles (SystemState Double)
> stateUpdate xPrevs = V.zipWith SystemState x1s x2s
>   where
>     ix = V.length xPrevs
>
>     x1Prevs = V.map x1 xPrevs
>     x2Prevs = V.map x2 xPrevs
>
>     deltaTs = V.replicate ix deltaT
>     gs = V.replicate ix g
>     x1s = x1Prevs .+ (x2Prevs .* deltaTs)
>     x2s = x2Prevs .- (gs .* (V.map sin x1Prevs) .* deltaTs)


and its noisy version.

> stateUpdateNoisy :: MonadRandom m =>
>                     Sym 2 ->
>                     Particles (SystemState Double) ->
>                     m (Particles (SystemState Double))
> stateUpdateNoisy bigQ xPrevs = do
>   let xs = stateUpdate xPrevs
>
>       x1s = V.map x1 xs
>       x2s = V.map x2 xs
>
>   let ix = V.length xPrevs
>   etas <- replicateM ix $sample$ rvar (MultivariateNormal 0.0 bigQ)
>
>   let eta1s, eta2s :: V.Vector Double
>       eta1s = V.fromList $map (fst . headTail) etas > eta2s = V.fromList$ map (fst . headTail . snd . headTail) etas
>
>   return (V.zipWith SystemState (x1s .+ eta1s) (x2s .+ eta2s))


The function which maps the state to the observable.

> obsUpdate :: Particles (SystemState Double) ->
>              Particles (SystemObs Double)
> obsUpdate xs = V.map (SystemObs . sin . x1) xs


And finally a function to calculate the weight of each particle given an observation.

> weight :: forall a n . KnownNat n =>
>           (a -> R n) ->
>           Sym n ->
>           a -> a -> Double
> weight f bigR obs obsNew = pdf (MultivariateNormal (f obsNew) bigR) (f obs)


The variance of the prior.

> bigP :: Sym 2
> bigP = sym $diag 0.1  Generate our ensemble of particles chosen from the prior, > initParticles :: MonadRandom m => > m (Particles (SystemState Double)) > initParticles = V.replicateM nParticles$ do
>   r <- sample $rvar (MultivariateNormal m0' bigP) > let x1 = fst$ headTail r
>       x2 = fst $headTail$ snd $headTail r > return$ SystemState { x1 = x1, x2 = x2}


run the particle filter,

> runFilter :: Int -> [Particles (SystemState Double)]
> runFilter nTimeSteps = snd $evalState action (pureMT 19) > where > action = runWriterT$ do
>       xs <- lift $initParticles > V.foldM > (oneFilteringStep (stateUpdateNoisy bigQ') obsUpdate (weight f bigR')) > xs > (V.fromList$ map (SystemObs . fst . headTail . snd)
>                           (take nTimeSteps pendulumSamples'))


and extract the estimated position from the filter.

> testFiltering :: Int -> [Double]
> testFiltering nTimeSteps = map ((/ (fromIntegral nParticles)). sum . V.map x1)
>                                (runFilter nTimeSteps)


# Smoothing

If we could calculate the marginal smoothing distributions $\{p(x_t \,|\, y_{1:T})\}_{i=1}^T$ then we might be able to sample from them. Using the Markov assumption of our model that $x_i$ is independent of $y_{i+1:N}$ given $x_{i+1}$, we have

\displaystyle \begin{aligned} \overbrace{p(x_i \,|\, y_{1:N})}^{\mathrm{smoother}\,\mathrm{at}\, i} &= \int p(x_i, x_{i+1} \,|\, y_{1:N}) \,\mathrm{d}x_{i+1} & \text{marginal distribution} \\ &= \int p(x_{i+1} \,|\, y_{1:N}) \,p(x_{i} \,|\, y_{1:N}, x_{i+1}) \,\mathrm{d}x_{i+1} & \text{conditional density} \\ &= \int p(x_{i+1} \,|\, y_{1:N}) \,p(x_{i} \,|\, y_{1:i}, x_{i+1}) \,\mathrm{d}x_{i+1} & \text{Markov model} \\ &= \int p(x_{i+1} \,|\, y_{1:N}) \, \frac{p(x_{i}, x_{i+1} \,|\, y_{1:i})}{p(x_{i+1} \,|\, y_{1:i})} \,\mathrm{d}x_{i+1} & \text{conditional density} \\ &= \int p(x_{i+1} \,|\, y_{1:N}) \, \frac{p(x_{i+1} \,|\, x_{i}, y_{1:i}) \,p(x_{i} \,|\, y_{1:i})}{p(x_{i+1} \,|\, y_{1:i})} \,\mathrm{d}x_{i+1} & \text{conditional density} \\ &= \int \overbrace{p(x_{i+1} \,|\, y_{1:N})}^{\text{smoother at }i+1} \, \underbrace{ \overbrace{p(x_{i} \,|\, y_{1:i})}^{\text{filter at }i} \frac{p(x_{i+1} \,|\, x_{i})} {p(x_{i+1} \,|\, y_{1:i})} } _{\text{backward transition }p(x_{i} \,|\, y_{1:i},\,x_{i+1})} \,\mathrm{d}x_{i+1} & \text{Markov model} \end{aligned}

We observe that this is a (continuous state space) Markov process with a non-homogeneous transition function albeit one which goes backwards in time. Apparently for conditionally Gaussian linear state-space models, this is known as RTS, or Rauch-Tung-Striebel smoothing (Rauch, Striebel, and Tung (1965)).

According to Cappé (2008),

• It appears to be efficient and stable in the long term (although no proof was available at the time the slides were presented).

• It is not sequential (in particular, one needs to store all particle positions and weights).

• It has numerical complexity proportional $O(n^2)$ where $N$ is the number of particles.

We can use this to sample paths starting at time $i = N$ and working backwards. From above we have

$\displaystyle p(x_i \,|\, X_{i+1}, Y_{1:N}) = \frac{p(X_{i+1} \,|\, x_{i}) \,p(x_{i} \,|\, Y_{1:i})}{p(X_{i+1} \,|\, Y_{1:i})} = Z \,p(X_{i+1} \,|\, x_{i}) \,p(x_{i} \,|\, Y_{1:i})$

where $Z$ is some normalisation constant (Z for “Zustandssumme” – sum over states).

From particle filtering we know that

$\displaystyle {p}(x_i \,|\, y_{1:i}) \approx \hat{p}(x_i \,|\, y_{1:i}) \triangleq \sum_{j=1}^M w_i^{(j)}\delta(x_i - x_i^{(j)})$

Thus

$\displaystyle \hat{p}(x_i \,|\, X_{i+1}, Y_{1:i}) = Z \,p(X_{i+1} \,|\, x_{i}) \,\hat{p}(x_{i} \,|\, Y_{1:i}) = \sum_{j=1}^M w_i^{(j)}\delta(x_i - x_i^{(j)}) \,p(X_{i+1} \,|\, x_{i})$

and we can sample $x_i$ from $\{x_i^{(j)}\}_{1 \leq j \leq M}$ with probability

$\displaystyle \frac{w_k^{(i)} \,p(X_{i+1} \,|\, x_{i})} {\sum_{i=1}^N w_k^{(i)} \,p(X_{i+1} \,|\, x_{i})}$

Recalling that we have re-sampled the particles in the filtering algorithm so that their weights are all $1/M$ and abstracting the state update and state density function, we can encode this update step in Haskell as

> oneSmoothingStep :: MonadRandom m =>
>          (Particles a -> V.Vector a) ->
>          (a -> a -> Double) ->
>          a ->
>          Particles a ->
>          WriterT (Particles a) m a
> oneSmoothingStep stateUpdate
>                  stateDensity
>                  smoothingSampleAtiPlus1
>                  filterSamplesAti = do it
>   where
>     it = do
>       let mus = stateUpdate filterSamplesAti
>           weights =  V.map (stateDensity smoothingSampleAtiPlus1) mus
>           cumSumWeights = V.tail $V.scanl (+) 0 weights > totWeight = V.last cumSumWeights > v <- lift$ sample $uniform 0.0 totWeight > let ix = binarySearch cumSumWeights v > xnNew = filterSamplesAti V.! ix > tell$ V.singleton xnNew
>       return $xnNew  To sample a complete path we start with a sample from the filtering distribution at at time $i = N$ (which is also the smoothing distribution) > oneSmoothingPath :: MonadRandom m => > (Int -> V.Vector (Particles a)) -> > (a -> Particles a -> WriterT (Particles a) m a) -> > Int -> m (a, V.Vector a) > oneSmoothingPath filterEstss oneSmoothingStep nTimeSteps = do > let ys = filterEstss nTimeSteps > ix <- sample$ uniform 0 (nParticles - 1)
>   let xn = (V.head ys) V.! ix
>   runWriterT $V.foldM oneSmoothingStep xn (V.tail ys)  > oneSmoothingPath' :: (MonadRandom m, Show a) => > (Int -> V.Vector (Particles a)) -> > (a -> Particles a -> WriterT (Particles a) m a) -> > Int -> WriterT (Particles a) m a > oneSmoothingPath' filterEstss oneSmoothingStep nTimeSteps = do > let ys = filterEstss nTimeSteps > ix <- lift$ sample $uniform 0 (nParticles - 1) > let xn = (V.head ys) V.! ix > V.foldM oneSmoothingStep xn (V.tail ys)  Of course we need to run through the filtering distributions starting at $i = N$ > filterEstss :: Int -> V.Vector (Particles (SystemState Double)) > filterEstss n = V.reverse$ V.fromList $runFilter n  > testSmoothing :: Int -> Int -> [Double] > testSmoothing m n = V.toList$ evalState action (pureMT 23)
>   where
>     action = do
>       xss <- V.replicateM n $> snd <$> (oneSmoothingPath filterEstss (oneSmoothingStep stateUpdate (weight h bigQ')) m)
>       let yss = V.fromList $map V.fromList$
>                 transpose $> V.toList$ V.map (V.toList) $> xss > return$ V.map (/ (fromIntegral n)) $V.map V.sum$ V.map (V.map x1) yss


By eye we can see we get a better fit

and calculating the mean square error for filtering gives $1.87\times10^{-2}$ against the mean square error for smoothing of $9.52\times10^{-3}$; this confirms the fit really is better as one would hope.

# Unknown Gravity

Let us continue with the same example but now assume that $g$ is unknown and that we wish to estimate it. Let us also assume that our apparatus is not subject to noise.

Again we have

$\displaystyle \frac{\mathrm{d}^2\alpha}{\mathrm{d}t^2} = -g\sin\alpha + w(t)$

But now when we discretize it we include a third variable

$\displaystyle \begin{bmatrix} x_{1,i} \\ x_{2,i} \\ x_{3,i} \end{bmatrix} = \begin{bmatrix} x_{1,i-1} + x_{2,i-1}\Delta t \\ x_{2,i-1} - x_{3,i-1}\sin x_{1,i-1}\Delta t \\ x_{3,i-1} \end{bmatrix} + \mathbf{q}_{i-1}$

where $q_i \sim {\mathcal{N}}(0,Q)$

$\displaystyle Q = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & \sigma^2_g \end{bmatrix}$

Again we assume that we can only measure the horizontal position of the pendulum so that

$\displaystyle y_i = \sin x_i + r_k$

where $r_i \sim {\mathcal{N}}(0,R)$.

> type PendulumStateG = R 3

> pendulumSampleG :: MonadRandom m =>
>                   Sym 3 ->
>                   Sym 1 ->
>                   PendulumStateG ->
>                   m (Maybe ((PendulumStateG, PendulumObs), PendulumStateG))
> pendulumSampleG bigQ bigR xPrev = do
>   let x1Prev = fst $headTail xPrev > x2Prev = fst$ headTail $snd$ headTail xPrev
>       x3Prev = fst $headTail$ snd $headTail$ snd $headTail xPrev > eta <- sample$ rvar (MultivariateNormal 0.0 bigQ)
>   let x1= x1Prev + x2Prev * deltaT
>       x2 = x2Prev - g * (sin x1Prev) * deltaT
>       x3 = x3Prev
>       xNew = vector [x1, x2, x3] + eta
>       x1New = fst $headTail xNew > epsilon <- sample$ rvar (MultivariateNormal 0.0 bigR)
>   let yNew = vector [sin x1New] + epsilon
>   return $Just ((xNew, yNew), xNew)  > pendulumSampleGs :: [(PendulumStateG, PendulumObs)] > pendulumSampleGs = evalState (ML.unfoldrM (pendulumSampleG bigQg bigRg) mG) (pureMT 29)  > data SystemStateG a = SystemStateG { gx1 :: a, gx2 :: a, gx3 :: a } > deriving Show  The state update itself > stateUpdateG :: Particles (SystemStateG Double) -> > Particles (SystemStateG Double) > stateUpdateG xPrevs = V.zipWith3 SystemStateG x1s x2s x3s > where > ix = V.length xPrevs > > x1Prevs = V.map gx1 xPrevs > x2Prevs = V.map gx2 xPrevs > x3Prevs = V.map gx3 xPrevs > > deltaTs = V.replicate ix deltaT > x1s = x1Prevs .+ (x2Prevs .* deltaTs) > x2s = x2Prevs .- (x3Prevs .* (V.map sin x1Prevs) .* deltaTs) > x3s = x3Prevs  and its noisy version. > stateUpdateNoisyG :: MonadRandom m => > Sym 3 -> > Particles (SystemStateG Double) -> > m (Particles (SystemStateG Double)) > stateUpdateNoisyG bigQ xPrevs = do > let ix = V.length xPrevs > > let xs = stateUpdateG xPrevs > > x1s = V.map gx1 xs > x2s = V.map gx2 xs > x3s = V.map gx3 xs > > etas <- replicateM ix$ sample $rvar (MultivariateNormal 0.0 bigQ) > let eta1s, eta2s, eta3s :: V.Vector Double > eta1s = V.fromList$ map (fst . headTail) etas
>       eta2s = V.fromList $map (fst . headTail . snd . headTail) etas > eta3s = V.fromList$ map (fst . headTail . snd . headTail . snd . headTail) etas
>
>   return (V.zipWith3 SystemStateG (x1s .+ eta1s) (x2s .+ eta2s) (x3s .+ eta3s))


The function which maps the state to the observable.

> obsUpdateG :: Particles (SystemStateG Double) ->
>              Particles (SystemObs Double)
> obsUpdateG xs = V.map (SystemObs . sin . gx1) xs


The mean and variance of the prior.

> mG :: R 3
> mG = vector [1.6, 0.0, 8.00]

> bigPg :: Sym 3
> bigPg = sym $matrix [ > 1e-6, 0.0, 0.0 > , 0.0, 1e-6, 0.0 > , 0.0, 0.0, 1e-2 > ]  Parameters for the state update; note that the variance is not exactly the same as in the formulation above. > bigQg :: Sym 3 > bigQg = sym$ matrix bigQgl

> qc1G :: Double
> qc1G = 0.0001

> sigmaG :: Double
> sigmaG = 1.0e-2

> bigQgl :: [Double]
> bigQgl = [ qc1G * deltaT^3 / 3, qc1G * deltaT^2 / 2, 0.0,
>            qc1G * deltaT^2 / 2,       qc1G * deltaT, 0.0,
>                            0.0,                 0.0, sigmaG
>          ]


The noise of the measurement.

> bigRg :: Sym 1
> bigRg  = sym $matrix [0.1]  Generate the ensemble of particles from the prior, > initParticlesG :: MonadRandom m => > m (Particles (SystemStateG Double)) > initParticlesG = V.replicateM nParticles$ do
>   r <- sample $rvar (MultivariateNormal mG bigPg) > let x1 = fst$ headTail r
>       x2 = fst $headTail$ snd $headTail r > x3 = fst$ headTail $snd$ headTail $snd$ headTail r
>   return $SystemStateG { gx1 = x1, gx2 = x2, gx3 = x3}  run the particle filter, > runFilterG :: Int -> [Particles (SystemStateG Double)] > runFilterG n = snd$ evalState action (pureMT 19)
>   where
>     action = runWriterT $do > xs <- lift$ initParticlesG
>       V.foldM
>         (oneFilteringStep (stateUpdateNoisyG bigQg) obsUpdateG (weight f bigRg))
>         xs
>         (V.fromList $map (SystemObs . fst . headTail . snd) (take n pendulumSampleGs))  and extract the estimated parameter from the filter. > testFilteringG :: Int -> [Double] > testFilteringG n = map ((/ (fromIntegral nParticles)). sum . V.map gx3) (runFilterG n)  Again we need to run through the filtering distributions starting at $i = N$ > filterGEstss :: Int -> V.Vector (Particles (SystemStateG Double)) > filterGEstss n = V.reverse$ V.fromList $runFilterG n  > testSmoothingG :: Int -> Int -> ([Double], [Double], [Double]) > testSmoothingG m n = (\(x, y, z) -> (V.toList x, V.toList y, V.toList z))$
>                      mkMeans $> chunks > where > > chunks = > V.fromList$ map V.fromList $> transpose$
>       V.toList $V.map (V.toList)$
>       chunksOf m $> snd$ evalState (runWriterT action) (pureMT 23)
>
>     mkMeans yss = (
>       V.map (/ (fromIntegral n)) $V.map V.sum$ V.map (V.map gx1) yss,
>       V.map (/ (fromIntegral n)) $V.map V.sum$ V.map (V.map gx2) yss,
>       V.map (/ (fromIntegral n)) $V.map V.sum$ V.map (V.map gx3) yss
>       )
>
>     action =
>       V.replicateM n > oneSmoothingPath' filterGEstss > (oneSmoothingStep stateUpdateG (weight hG bigQg)) > m  Again by eye we get a better fit but note that we are using the samples in which the state update is noisy as well as the observation so we don’t expect to get a very good fit. # Notes ## Helpers for Converting Types > f :: SystemObs Double -> R 1 > f = vector . pure . y1  > h :: SystemState Double -> R 2 > h u = vector [x1 u , x2 u]  > hG :: SystemStateG Double -> R 3 > hG u = vector [gx1 u , gx2 u, gx3 u]  ## Helpers for the Inverse CDF That these are helpers for the inverse CDF is delayed to another blog post. > indices :: V.Vector Double -> V.Vector Double -> V.Vector Int > indices bs xs = V.map (binarySearch bs) xs  > binarySearch :: (Ord a) => > V.Vector a -> a -> Int > binarySearch vec x = loop 0 (V.length vec - 1) > where > loop !l !u > | u <= l = l > | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u > where k = l + (u - l) shiftR 1  ## Vector Helpers > chunksOf :: Int -> V.Vector a -> V.Vector (V.Vector a) > chunksOf n xs = ys > where > l = V.length xs > m = 1 + (l - 1) div n > ys = V.unfoldrN m (\us -> Just (V.take n us, V.drop n us)) xs  # Bibliography Cappé, Olivier. 2008. “An Introduction to Sequential Monte Carlo for Filtering and Smoothing.” http://www-irma.u-strasbg.fr/~guillou/meeting/cappe.pdf. Rauch, H. E., C. T. Striebel, and F. Tung. 1965. “Maximum Likelihood Estimates of Linear Dynamic Systems.” Journal of the American Institute of Aeronautics and Astronautics 3 (8): 1445–50. Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. New York, NY, USA: Cambridge University Press. # Naive Particle Smoothing is Degenerate # Introduction Let $\{X_t\}_{t \geq 1}$ be a (hidden) Markov process. By hidden, we mean that we are not able to observe it. $\displaystyle X_1 \sim \mu(\centerdot) \quad X_t \,|\, (X_{t-1} = x) \sim f(\centerdot \,|\, x)$ And let $\{Y_t\}_{t \geq 1}$ be an observable Markov process such that $\displaystyle Y_t \,|\, (X_{t} = x) \sim g(\centerdot \,|\, x)$ That is the observations are conditionally independent given the state of the hidden process. As an example let us take the one given in Särkkä (2013) where the movement of a car is given by Newton’s laws of motion and the acceleration is modelled as white noise. \displaystyle \begin{aligned} X_t &= AX_{t-1} + Q \epsilon_t \\ Y_t &= HX_t + R \eta_t \end{aligned} Although we do not do so here, $A, Q, H$ and $R$ can be derived from the dynamics. For the purpose of this blog post, we note that they are given by $\displaystyle A = \begin{bmatrix} 1 & 0 & \Delta t & 0 \\ 0 & 1 & 0 & \Delta t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} , \quad Q = \begin{bmatrix} \frac{q_1^c \Delta t^3}{3} & 0 & \frac{q_1^c \Delta t^2}{2} & 0 \\ 0 & \frac{q_2^c \Delta t^3}{3} & 0 & \frac{q_2^c \Delta t^2}{2} \\ \frac{q_1^c \Delta t^2}{2} & 0 & {q_1^c \Delta t} & 0 \\ 0 & \frac{q_2^c \Delta t^2}{2} & 0 & {q_2^c \Delta t} \end{bmatrix}$ and $\displaystyle H = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix} , \quad R = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{bmatrix}$ We wish to determine the position and velocity of the car given noisy observations of the position. In general we need the distribution of the hidden path given the observable path. We use the notation $x_{m:n}$ to mean the path of $x$ starting a $m$ and finishing at $n$. $\displaystyle p(x_{1:n} \,|\, y_{1:n}) = \frac{p(x_{1:n}, y_{1:n})}{p(y_{1:n})}$ ## Haskell Preamble > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-}  > {-# LANGUAGE FlexibleInstances #-} > {-# LANGUAGE MultiParamTypeClasses #-} > {-# LANGUAGE FlexibleContexts #-} > {-# LANGUAGE TypeFamilies #-} > {-# LANGUAGE BangPatterns #-} > {-# LANGUAGE GeneralizedNewtypeDeriving #-} > {-# LANGUAGE ScopedTypeVariables #-} > {-# LANGUAGE TemplateHaskell #-}  > module ParticleSmoothing > ( simpleSamples > , carSamples > , testCar > , testSimple > ) where  > import Data.Random.Source.PureMT > import Data.Random hiding ( StdNormal, Normal ) > import qualified Data.Random as R > import Control.Monad.State > import Control.Monad.Writer hiding ( Any, All ) > import qualified Numeric.LinearAlgebra.HMatrix as H > import Foreign.Storable ( Storable ) > import Data.Maybe ( fromJust ) > import Data.Bits ( shiftR ) > import qualified Data.Vector as V > import qualified Data.Vector.Unboxed as U > import Control.Monad.ST > import System.Random.MWC  > import Data.Array.Repa ( Z(..), (:.)(..), Any(..), computeP, > extent, DIM1, DIM2, slice, All(..) > ) > import qualified Data.Array.Repa as Repa  > import qualified Control.Monad.Loops as ML  > import PrettyPrint () > import Text.PrettyPrint.HughesPJClass ( Pretty, pPrint )  > import Data.Vector.Unboxed.Deriving  # Some Theory If we could sample $X_{1:n}^{(i)} \sim p(x_{1:n} \,|\, y_{1:n})$ then we could approximate the posterior as $\displaystyle \hat{p}(x_{1:n} \,|\, y_{1:n}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:n}^{(i)}}(x_{1:n})$ If we wish to, we can create marginal estimates $\displaystyle \hat{p}(x_k \,|\, y_{1:n}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{k}^{(i)}}(x_{k})$ When $k = N$, this is the filtering estimate. ## Standard Bayesian Recursion Prediction \displaystyle \begin{aligned} p(x_n \,|\, y_{1:n-1}) &= \int p(x_{n-1:n} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\ &= \int p(x_{n} \,|\, x_{n-1}, y_{1:n-1}) \, p(x_{n-1} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\ &= \int f(x_{n} \,|\, x_{n-1}) \, p(x_{n-1} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\ \end{aligned} Update \displaystyle \begin{aligned} p(x_n \,|\, y_{1:n}) &= \frac{p(y_n \,|\, x_n, y_{1:n-1}) \, p(x_n \,|\, y_{1:n-1})} {p(y_n \,|\, y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_n) \, p(x_n \,|\, y_{1:n-1})} {p(y_n \,|\, y_{1:n-1})} \end{aligned} where by definition $\displaystyle {p(y_n \,|\, y_{1:n-1})} = \int {g(y_n \,|\, x_n) \, p(x_n \,|\, y_{1:n-1})} \,\mathrm{d}x_n$ ## Path Space Recursion We have \displaystyle \begin{aligned} p(x_{1:n} \,|\, y_{1:n}) &= \frac{p(x_{1:n}, y_{1:n})}{p(y_{1:n})} \\ &= \frac{p(x_n, y_n \,|\, x_{1:n-1}, y_{1:n-1})}{p(y_{1:n})} \, p(x_{1:n-1}, y_{1:n-1}) \\ &= \frac{p(y_n \,|\, x_{1:n}, y_{1:n-1}) \, p(x_n \,|\, x_{1:n-1}, y_{1:n-1}) }{p(y_{1:n})} \, p(x_{1:n-1}, y_{1:n-1}) \\ &= \frac{g(y_n \,|\, x_{n}) \, f(x_n \,|\, x_{n-1})}{p(y_n \,|\, y_{1:n-1})} \, \frac{p(x_{1:n-1}, y_{1:n-1})}{ \, p(y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, f(x_n \,|\, x_{n-1})}{p(y_n \,|\, y_{1:n-1})} \, {p(x_{1:n-1} \,|\,y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, \overbrace{f(x_n \,|\, x_{n-1}) \, {p(x_{1:n-1} \,|\,y_{1:n-1})}}^{\mathrm{predictive}\,p(x_{1:n} \,|\, y_{1:n-1})}} {p(y_n \,|\, y_{1:n-1})} \\ \end{aligned} where by definition $\displaystyle p(y_n \,|\, y_{1:n-1}) = \int g(y_n \,|\, x_n) \, p(x_{1:n} \,|\, y_{1:n-1}) \,\mathrm{d}x_{1:n}$ Prediction $\displaystyle p(x_{1:n} \,|\, y_{1:n-1}) = f(x_n \,|\, x_{n-1}) \, {p(x_{1:n-1} \,|\,y_{1:n-1})}$ Update $\displaystyle p(x_{1:n} \,|\, y_{1:n}) = \frac{g(y_n \,|\, x_{n}) \, {p(x_{1:n} \,|\, y_{1:n-1})}} {p(y_n \,|\, y_{1:n-1})}$ ## Algorithm The idea is to simulate paths using the recursion we derived above. At time $n-1$ we have an approximating distribution $\displaystyle \hat{p}(x_{1:n-1} \,|\, y_{1:n-1}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:n-1}}^{(i)}(x_{1:n-1})$ Sample $\tilde{X}_n^{(i)} \sim f(\centerdot \,|\, X_{n-1}^{(i)})$ and set $\tilde{X}_{1:n}^{(i)} = (\tilde{X}_{1:n-1}^{(i)}, \tilde{X}_n^{(i)})$. We then have an approximation of the prediction step $\displaystyle \hat{p}(x_{1:n} \,|\, y_{1:n-1}) = \frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})$ Substituting \displaystyle \begin{aligned} {\hat{p}(y_n \,|\, y_{1:n-1})} &= \int {g(y_n \,|\, x_n) \, \hat{p}(x_{1:n} \,|\, y_{1:n-1})} \,\mathrm{d}x_n \\ &= \int {g(y_n \,|\, x_n)}\frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n-1}}^{(i)}(x_{1:n}) \,\mathrm{d}x_n \\ &= \frac{1}{N}\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})} \end{aligned} and again \displaystyle \begin{aligned} \tilde{p}(x_{1:n} \,|\, y_{1:n}) &= \frac{g(y_n \,|\, x_{n}) \, {\hat{p}(x_{1:n} \,|\, y_{1:n-1})}} {\hat{p}(y_n \,|\, y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, \frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})} {\frac{1}{N}\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})}} \\ &= \frac{ \sum_{i=1}^N g(y_n \,|\, \tilde{X}_n^{(i)}) \, \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})} {\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})}} \\ &= \sum_{i=1}^N W_n^{(i)} \delta_{\tilde{X}_{1:n}^{(i)}} (x_{1:n}) \end{aligned} where $W_n^{(i)} \propto g(y_n \,|\, \tilde{X}_n^{(i)})$ and $\sum_{i=1}^N W_n^{(i)} = 1$. Now sample $\displaystyle X_{1:n}^{(i)} \sim \tilde{p}(x_{1:n} \,|\, y_{1:n})$ # A Haskell Implementation Let’s specify some values for the example of the car moving in two dimensions. > deltaT, sigma1, sigma2, qc1, qc2 :: Double > deltaT = 0.1 > sigma1 = 1/2 > sigma2 = 1/2 > qc1 = 1 > qc2 = 1  > bigA :: H.Matrix Double > bigA = (4 H.>< 4) bigAl  > bigAl :: [Double] > bigAl = [1, 0 , deltaT, 0, > 0, 1, 0, deltaT, > 0, 0, 1, 0, > 0, 0, 0, 1]  > bigQ :: H.Herm Double > bigQ = H.trustSym (4 H.>< 4) bigQl

> bigQl :: [Double]
> bigQl = [qc1 * deltaT^3 / 3,                  0, qc1 * deltaT^2 / 2,                  0,
>                           0, qc2 * deltaT^3 / 3,                  0, qc2 * deltaT^2 / 2,
>          qc1 * deltaT^2 / 2,                  0,       qc1 * deltaT,                  0,
>                           0, qc2 * deltaT^2 / 2,                  0,       qc2 * deltaT]

> bigH :: H.Matrix Double
> bigH = (2 H.>< 4) [1, 0, 0, 0,
>                    0, 1, 0, 0]

> bigR :: H.Herm Double
> bigR = H.trustSym $(2 H.>< 2) [sigma1^2, 0, > 0, sigma2^2]  > m0 :: H.Vector Double > m0 = H.fromList [0, 0, 1, -1]  > bigP0 :: H.Herm Double > bigP0 = H.trustSym$ H.ident 4

> n :: Int
> n = 23


With these we generate hidden and observable sample path.

> carSample :: MonadRandom m =>
>              H.Vector Double ->
>              m (Maybe ((H.Vector Double, H.Vector Double), H.Vector Double))
> carSample xPrev = do
>   xNew <- sample $rvar (Normal (bigA H.#> xPrev) bigQ) > yNew <- sample$ rvar (Normal (bigH H.#> xNew) bigR)
>   return $Just ((xNew, yNew), xNew)  > carSamples :: [(H.Vector Double, H.Vector Double)] > carSamples = evalState (ML.unfoldrM carSample m0) (pureMT 17)  We can plot an example trajectory for the car and the noisy observations that are available to the smoother / filter. Sadly there is no equivalent to numpy in Haskell. Random number packages generate vectors, for multi-rank arrays there is repa and for fast matrix manipulation there is hmtatrix. Thus for our single step path update function, we have to pass in functions to perform type conversion. Clearly with all the copying inherent in this approach, performance is not going to be great. The type synonym ArraySmoothing is used to denote the cloud of particles at each time step. > type ArraySmoothing = Repa.Array Repa.U DIM2  > singleStep :: forall a . U.Unbox a => > (a -> H.Vector Double) -> > (H.Vector Double -> a) -> > H.Matrix Double -> > H.Herm Double -> > H.Matrix Double -> > H.Herm Double -> > ArraySmoothing a -> H.Vector Double -> > WriterT [ArraySmoothing a] (StateT PureMT IO) (ArraySmoothing a) > singleStep f g bigA bigQ bigH bigR x y = do > tell[x] > let (Z :. ix :. jx) = extent x > > xHatR <- lift$ computeP $Repa.slice x (Any :. jx - 1) > let xHatH = map f$ Repa.toList (xHatR  :: Repa.Array Repa.U DIM1 a)
>   xTildeNextH <- lift $mapM (\x -> sample$ rvar (Normal (bigA H.#> x) bigQ)) xHatH
>
>   let xTildeNextR = Repa.fromListUnboxed (Z :. ix :. (1 :: Int)) $> map g xTildeNextH > xTilde = Repa.append x xTildeNextR > > weights = map (normalPdf y bigR)$
>                 map (bigH H.#>) xTildeNextH
>       vs = runST (create >>= (asGenST $\gen -> uniformVector gen n)) > cumSumWeights = V.scanl (+) 0 (V.fromList weights) > totWeight = sum weights > js = indices (V.map (/ totWeight)$ V.tail cumSumWeights) vs
>       xNewV = V.map (\j -> Repa.transpose $> Repa.reshape (Z :. (1 :: Int) :. jx + 1)$
>                            slice xTilde (Any :. j :. All)) js
>       xNewR = Repa.transpose $V.foldr Repa.append (xNewV V.! 0) (V.tail xNewV) > computeP xNewR  The state for the car is a 4-tuple. > data SystemState a = SystemState { xPos :: a > , yPos :: a > , _xSpd :: a > , _ySpd :: a > }  We initialize the smoother from some prior distribution. > initCar :: StateT PureMT IO (ArraySmoothing (SystemState Double)) > initCar = do > xTilde1 <- replicateM n$ sample $rvar (Normal m0 bigP0) > let weights = map (normalPdf (snd$ head carSamples) bigR) $> map (bigH H.#>) xTilde1 > vs = runST (create >>= (asGenST$ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList weights)
>       js = indices (V.tail cumSumWeights) vs
>       xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int)) $> map ((\[a,b,c,d] -> SystemState a b c d) . H.toList)$
>               V.toList $> V.map ((V.fromList xTilde1) V.!) js > return xHat1  Now we can run the smoother. > smootherCar :: StateT PureMT IO > (ArraySmoothing (SystemState Double) > , [ArraySmoothing (SystemState Double)]) > smootherCar = runWriterT$ do
>   xHat1 <- lift initCar
>   foldM (singleStep f g bigA bigQ bigH bigR) xHat1 (take 100 $map snd$ tail carSamples)

> f :: SystemState Double -> H.Vector Double
> f (SystemState a b c d) = H.fromList [a, b, c, d]

> g :: H.Vector Double -> SystemState Double
> g = (\[a,b,c,d] -> (SystemState a b c d)) . H.toList


And create inferred positions for the car which we then plot.

> testCar :: IO ([Double], [Double])
> testCar = do
>   states <- snd <$> evalStateT smootherCar (pureMT 24) > let xs :: [Repa.Array Repa.D DIM2 Double] > xs = map (Repa.map xPos) states > sumXs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose xs) > let ixs = map extent sumXs > sumLastXs = map (* (recip$ fromIntegral n)) $> zipWith (Repa.!) sumXs (map (\(Z :. x) -> Z :. (x - 1)) ixs) > let ys :: [Repa.Array Repa.D DIM2 Double] > ys = map (Repa.map yPos) states > sumYs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose ys) > let ixsY = map extent sumYs > sumLastYs = map (* (recip$ fromIntegral n)) $> zipWith (Repa.!) sumYs (map (\(Z :. x) -> Z :. (x - 1)) ixsY) > return (sumLastXs, sumLastYs)  So it seems our smoother does quite well at inferring the state at the latest observation, that is, when it is working as a filter. But what about estimates for earlier times? We should do better as we have observations in the past and in the future. Let’s try with a simpler example and a smaller number of particles. First we create some samples for our simple 1 dimensional linear Gaussian model. > bigA1, bigQ1, bigR1, bigH1 :: Double > bigA1 = 0.5 > bigQ1 = 0.1 > bigR1 = 0.1 > bigH1 = 1.0  > simpleSample :: MonadRandom m => > Double -> > m (Maybe ((Double, Double), Double)) > simpleSample xPrev = do > xNew <- sample$ rvar (R.Normal (bigA1 * xPrev) bigQ1)
>   yNew <- sample $rvar (R.Normal (bigH1 * xNew) bigR1) > return$ Just ((xNew, yNew), xNew)

> simpleSamples :: [(Double, Double)]
> simpleSamples = evalState (ML.unfoldrM simpleSample 0.0) (pureMT 17)


Again create a prior.

> initSimple :: MonadRandom m => m (ArraySmoothing Double)
> initSimple = do
>   let y = snd $head simpleSamples > xTilde1 <- replicateM n$ sample $rvar$ R.Normal y bigR1
>   let weights = map (pdf (R.Normal y bigR1)) $> map (bigH1 *) xTilde1 > totWeight = sum weights > vs = runST (create >>= (asGenST$ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList $map (/ totWeight) weights) > js = indices (V.tail cumSumWeights) vs > xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int))$
>               V.toList $> V.map ((V.fromList xTilde1) V.!) js > return xHat1  Now we can run the smoother. > smootherSimple :: StateT PureMT IO (ArraySmoothing Double, [ArraySmoothing Double]) > smootherSimple = runWriterT$ do
>   xHat1 <- lift initSimple
>   foldM (singleStep f1 g1 ((1 H.>< 1) [bigA1]) (H.trustSym $(1 H.>< 1) [bigQ1^2]) > ((1 H.>< 1) [bigH1]) (H.trustSym$ (1 H.>< 1) [bigR1^2]))
>         xHat1
>         (take 20 $map H.fromList$ map return . map snd $tail simpleSamples)  > f1 :: Double -> H.Vector Double > f1 a = H.fromList [a]  > g1 :: H.Vector Double -> Double > g1 = (\[a] -> a) . H.toList  And finally we can look at the paths not just the means of the marginal distributions at the latest observation time. > testSimple :: IO [[Double]] > testSimple = do > states <- snd <$> evalStateT smootherSimple (pureMT 24)
>   let path :: Int -> IO (Repa.Array Repa.U DIM1 Double)
>       path i = computeP $Repa.slice (last states) (Any :. i :. All) > paths <- mapM path [0..n - 1] > return$ map Repa.toList paths


We can see that at some point in the past all the current particles have one ancestor. The marginals of the smoothing distribution (at some point in the past) have collapsed on to one particle.

# Notes

## Helpers for the Inverse CDF

That these are helpers for the inverse CDF is delayed to another blog post.

> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs

> binarySearch :: Ord a =>
>                 V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
>   where
>     loop !l !u
>       | u <= l    = l
>       | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
>       where k = l + (u - l) shiftR 1


## Multivariate Normal

The random-fu package does not contain a sampler or pdf for a multivariate normal so we create our own. This should be added to random-fu-multivariate package or something similar.

> normalMultivariate :: H.Vector Double -> H.Herm Double -> RVarT m (H.Vector Double)
> normalMultivariate mu bigSigma = do
>   z <- replicateM (H.size mu) (rvarT R.StdNormal)
>   return $mu + bigA H.#> (H.fromList z) > where > (vals, bigU) = H.eigSH bigSigma > lSqrt = H.diag$ H.cmap sqrt vals
>     bigA = bigU H.<> lSqrt

> data family Normal k :: *

> data instance Normal (H.Vector Double) = Normal (H.Vector Double) (H.Herm Double)

> instance Distribution Normal (H.Vector Double) where
>   rvar (Normal m s) = normalMultivariate m s

> normalPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) =>
>              H.Vector a -> H.Herm a -> H.Vector a -> a
> normalPdf mu sigma x = exp $normalLogPdf mu sigma x  > normalLogPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) => > H.Vector a -> H.Herm a -> H.Vector a -> a > normalLogPdf mu bigSigma x = - H.sumElements (H.cmap log (diagonals dec)) > - 0.5 * (fromIntegral (H.size mu)) * log (2 * pi) > - 0.5 * s > where > dec = fromJust$ H.mbChol bigSigma
>     t = fromJust $H.linearSolve (H.tr dec) (H.asColumn$ x - mu)
>     u = H.cmap (\x -> x * x) t
>     s = H.sumElements u

> diagonals :: (Storable a, H.Element t, H.Indexable (H.Vector t) a) =>
>              H.Matrix t -> H.Vector a
> diagonals m = H.fromList (map (\i -> m H.! i H.! i) [0..n-1])
>   where
>     n = max (H.rows m) (H.cols m)

> instance PDF Normal (H.Vector Double) where
>   pdf (Normal m s) = normalPdf m s
>   logPdf (Normal m s) = normalLogPdf m s


## Misellaneous

> derivingUnbox "SystemState"
>     [t| forall a . (U.Unbox a) => SystemState a -> (a, a, a, a) |]
>     [| \ (SystemState x y xdot ydot) -> (x, y, xdot, ydot) |]
>     [| \ (x, y, xdot, ydot) -> SystemState x y xdot ydot |]

> instance Pretty a => Pretty (SystemState a) where
>   pPrint (SystemState x y xdot ydot ) = pPrint (x, y, xdot, ydot)


# Bibliography

Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. New York, NY, USA: Cambridge University Press.

# Floating Point: A Faustian Bargain?

Every so often, someone bitten by floating point arithmetic behaving in an unexpected way is tempted to suggest that a calculation should be done be precisely and rounding done at the end. With floating point rounding is done at every step.

Here’s an example of why floating point might really be the best option for numerical calculations.

Suppose you wish to find the roots of a quintic equation.

> import Numeric.AD
> import Data.List
> import Data.Ratio

> p :: Num a => a -> a
> p x = x^5 - 2*x^4 - 3*x^3 + 3*x^2 - 2*x - 1


We can do so using Newton-Raphson using automatic differentiation to calculate the derivative (even though for polynomials this is trivial).

> nr :: Fractional a => [a]
> nr = unfoldr g 0
>   where
>     g z = let u = z - (p z) / (h z) in Just (u, u)
>     h z = let [y] = grad (\[x] -> p x) [z] in y


After 7 iterations we see the size of the denominator is quite large (33308 digits) and the calculation takes many seconds.

ghci> length $show$ denominator (nr!!7)
33308


On the other hand if we use floating point we get an answer accurate to 1 in $2^{53}$ after 7 iterations very quickly.

ghci> mapM_ putStrLn $map show$ take 7 nr
-0.5
-0.3368421052631579
-0.31572844839628944
-0.31530116270327685
-0.31530098645936266
-0.3153009864593327
-0.3153009864593327


The example is taken from here who refers the reader to Nick Higham’s book: Accuracy and Stability of Numerical Algorithms.

Of course we should check we found a right answer.

ghci> p nr!!6 0.0  # Population Growth Estimation via Markov Chain Monte Carlo # Introduction Let us see if we can estimate the parameter for population growth using MCMC in the example in which we used Kalman filtering. We recall the model. \displaystyle \begin{aligned} \dot{p} & = rp\Big(1 - \frac{p}{k}\Big) \end{aligned} $\displaystyle p = \frac{kp_0\exp rt}{k + p_0(\exp rt - 1)}$ And we are allowed to sample at regular intervals \displaystyle \begin{aligned} p_i &= \frac{kp_0\exp r\Delta T i}{k + p_0(\exp r\Delta T i - 1)} \\ y_i &= p_i + \epsilon_i \end{aligned} In other words $y_i \sim {\cal{N}}(p_i, \sigma^2)$, where $\sigma$ is known so the likelihood is $\displaystyle p(y\,|\,r) \propto \prod_{i=1}^n \exp{\bigg( -\frac{(y_i - p_i)^2}{2\sigma^2}\bigg)} = \exp{\bigg( -\sum_{i=1}^n \frac{(y_i - p_i)^2}{2\sigma^2}\bigg)}$ Let us assume a prior of $r \sim {\cal{N}}(\mu_0,\sigma_0^2)$ then the posterior becomes $\displaystyle p(r\,|\,y) \propto \exp{\bigg( -\frac{(r - \mu_0)^2}{2\sigma_0^2} \bigg)} \exp{\bigg( -\sum_{i=1}^n \frac{(y_i - p_i)^2}{2\sigma^2}\bigg)}$ ## Preamble > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-}  > {-# LANGUAGE NoMonomorphismRestriction #-}  > module PopGrowthMCMC where > > import qualified Data.Vector.Unboxed as V > import Data.Random.Source.PureMT > import Data.Random > import Control.Monad.State > import Data.Histogram.Fill > import Data.Histogram.Generic ( Histogram )  # Implementation We assume most of the parameters are known with the exception of the the growth rate $r$. We fix this also in order to generate test data. > k, p0 :: Double > k = 1.0 > p0 = 0.1  > r, deltaT :: Double > r = 10.0 > deltaT = 0.0005  > nObs :: Int > nObs = 150  Here’s the implementation of the logistic function > logit :: Double -> Double -> Double -> Double > logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))  Let us create some noisy data. > sigma :: Double > sigma = 1e-2  > samples :: [Double] > samples = zipWith (+) mus epsilons > where > mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..]) > epsilons = evalState (sample replicateM nObs $rvar (Normal 0.0 sigma)) (pureMT 3)  Arbitrarily let us set the prior to have a rather vague normal distribution. > mu0, sigma0 :: Double > mu0 = 5.0 > sigma0 = 1e1  > prior :: Double -> Double > prior r = exp (-(r - mu0)**2 / (2 * sigma0**2))  > likelihood :: Double -> [Double] -> Double > likelihood r ys = exp (-sum (zipWith (\y mu -> (y - mu)**2 / (2 * sigma**2)) ys mus)) > where > mus :: [Double] > mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])  > posterior :: Double -> [Double] -> Double > posterior r ys = likelihood r ys * prior r  The Metropolis algorithm tells us that we always jump to a better place but only sometimes jump to a worse place. We count the number of acceptances as we go. > acceptanceProb' :: Double -> Double -> [Double] -> Double > acceptanceProb' r r' ys = min 1.0 ((posterior r' ys) / (posterior r ys))  > oneStep :: (Double, Int) -> (Double, Double) -> (Double, Int) > oneStep (r, nAccs) (proposedJump, acceptOrReject) = > if acceptOrReject < acceptanceProb' r (r + proposedJump) samples > then (r + proposedJump, nAccs + 1) > else (r, nAccs)  Here are our proposals. > normalisedProposals :: Int -> Double -> Int -> [Double] > normalisedProposals seed sigma nIters = > evalState (replicateM nIters (sample (Normal 0.0 sigma))) > (pureMT$ fromIntegral seed)


We also need samples from the uniform distribution

> acceptOrRejects :: Int -> Int -> [Double]
> acceptOrRejects seed nIters =
>   evalState (replicateM nIters (sample stdUniform))
>   (pureMT $fromIntegral seed)  Now we can actually run our simulation. We set the number of jumps and a burn in but do not do any thinning. > nIters, burnIn :: Int > nIters = 100000 > burnIn = nIters div 10  Let us start our chain to the mean of the prior. In theory this shoudn’t matter as by the time we have burnt in we should be sampling in the high density region of the distribution. > startMu :: Double > startMu = 5.0  This jump size should allow us to sample the region of high density at a reasonable granularity. > jumpVar :: Double > jumpVar = 0.01  Now we can test our MCMC implementation. > test :: Int -> [(Double, Int)] > test seed = > drop burnIn$
>   scanl oneStep (startMu, 0) $> zip (normalisedProposals seed jumpVar nIters) > (acceptOrRejects seed nIters)  We put the data into a histogram. > numBins :: Int > numBins = 400  > hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double) > hb = forceDouble -<< mkSimple (binD lower numBins upper) > where > lower = r - 0.5 * sigma0 > upper = r + 0.5 * sigma0 > > hist :: Int -> Histogram V.Vector BinD Double > hist seed = fillBuilder hb (map fst$ test seed)


With 50 observations we don’t seem to be very certain about the growth rate.

With 100 observations we do very much better.

And with 150 observations we do even better.

# Thames Flux

It is roughly 150 miles from the source of the Thames to Kingston Bridge. If we assume that it flows at about 2 miles per hour then the water at Thames Head will have reached Kingston very roughly at $\frac{150}{24\times 2} \approxeq 3$ days.

The Environmental Agency measure the flux at Kingston Bridge on a twice daily basis. Can we predict this? In the first instance without any other data and using our observation that Thames flushes itself every 3 days, let us try

$\displaystyle X_t = \theta_1 X_{t-1} + \theta_2 X_{t-2} + \theta_3 X_{t-3} + \epsilon_t$

where $X_t$ is the flux on day $t$ and $\{\epsilon_t\}_{t \in \mathbb{N}}$ are independent normal errors with mean $0$ and variance some given value $\sigma^2$.

# Kalman

As it stands, our model is not Markov so we cannot directly apply techniques such as Kalman filtering or particle filtering to estimate the parameters. However we can re-write the model as

$\displaystyle \begin{bmatrix} \theta_1^{(t)} \\ \theta_2^{(t)} \\ \theta_3^{(t)} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \theta_1^{(t-1)} \\ \theta_2^{(t-1)} \\ \theta_3^{(t-1)} \end{bmatrix} + \begin{bmatrix} \eta_{t} \\ \eta_{t} \\ \eta_{t} \end{bmatrix}$

$\displaystyle y_t = \begin{bmatrix} x_{t-1} & x_{t-2} & x_{t-3} \end{bmatrix} \begin{bmatrix} \theta_1^{(t)} \\ \theta_2^{(t)} \\ \theta_3^{(t)} \end{bmatrix} + \epsilon_{t}$

Note that the observation map now varies over time so we have modify our Kalman filter implementation to accept a different matrix at each step.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}

> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> {-# LANGUAGE TypeOperators                #-}
> {-# LANGUAGE TypeFamilies                 #-}

> module Autoregression (
>     predictions
>   ) where

> import GHC.TypeLits
> import Numeric.LinearAlgebra.Static
> import Data.Maybe ( fromJust )

> import qualified Data.Vector as V

> inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n
> inv m = fromJust $linSolve m eye  > outer :: forall m n . (KnownNat m, KnownNat n, > (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) => > R n -> Sq n -> [L m n] -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)] > outer muPrior sigmaPrior bigHs bigSigmaY bigA bigSigmaX ys = result > where > result = scanl update (muPrior, sigmaPrior) (zip ys bigHs) > > update :: (R n, Sq n) -> (R m, L m n) -> (R n, Sq n) > update (xHatFlat, bigSigmaHatFlat) (y, bigH) = > (xHatFlatNew, bigSigmaHatFlatNew) > where > v :: R m > v = y - bigH #> xHatFlat > bigS :: Sq m > bigS = bigH bigSigmaHatFlat (tr bigH) + bigSigmaY > bigK :: L n m > bigK = bigSigmaHatFlat (tr bigH) (inv bigS) > xHat :: R n > xHat = xHatFlat + bigK #> v > bigSigmaHat :: Sq n > bigSigmaHat = bigSigmaHatFlat - bigK bigS (tr bigK) > xHatFlatNew :: R n > xHatFlatNew = bigA #> xHat > bigSigmaHatFlatNew :: Sq n > bigSigmaHatFlatNew = bigA bigSigmaHat (tr bigA) + bigSigmaX  We can now set up the parameters to run the filter. > stateVariance :: Double > stateVariance = 1e-8  > bigSigmaX :: Sq 3 > bigSigmaX = fromList [ stateVariance, 0.0, 0.0 > , 0.0, stateVariance, 0.0 > , 0.0, 0.0, stateVariance > ]  > bigA :: Sq 3 > bigA = eye  > muPrior :: R 3 > muPrior = fromList [0.0, 0.0, 0.0]  > sigmaPrior :: Sq 3 > sigmaPrior = fromList [ 1e1, 0.0, 0.0 > , 0.0, 1e1, 0.0 > , 0.0, 0.0, 1e1 > ]  > bigHsBuilder :: V.Vector Double -> [L 1 3] > bigHsBuilder flows = > V.toList$
>   V.zipWith3 (\x0 x1 x2 -> fromList [x0, x1, x2])
>   (V.tail flows) (V.tail $V.tail flows) (V.tail$ V.tail $V.tail flows)  > obsVariance :: Double > obsVariance = 1.0e-2  > bigSigmaY :: Sq 1 > bigSigmaY = fromList [ obsVariance ]  > predict :: R 3 -> Double -> Double -> Double -> Double > predict theta f1 f2 f3 = h1 * f1 + h2 * f2 + h3 * f3 > where > (h1, t1) = headTail theta > (h2, t2) = headTail t1 > (h3, _) = headTail t2  > thetas :: V.Vector Double -> [(R 3, Sq 3)] > thetas flows = outer muPrior sigmaPrior (bigHsBuilder flows) > bigSigmaY bigA bigSigmaX (map (fromList . return) (V.toList flows))  > predictions :: V.Vector Double -> V.Vector Double > predictions flows = > V.zipWith4 predict > (V.fromList$ map fst (thetas flows))
>   flows (V.tail flows) (V.tail $V.tail flows)  # How Good is Our Model? If we assume that parameters are essentially fixed by taking the state variance to be e.g. $10^{-8}$ then the fit is not good. However, if we assume the parameters to undergo Brownian motion by taking the state variance to be e.g. $10^{-2}$ then we get a much better fit. Of course, Brownian motion is probably not a good way of modelling the parameters; we hardly expect that these could wander off to infinity. # Rejection Sampling # Introduction Suppose you want to sample from the truncated normal distribution. One way to do this is to use rejection sampling. But if you do this naïvely then you will run into performance problems. The excellent Devroye (1986) who references Marsaglia (1964) gives an efficient rejection sampling scheme using the Rayleigh distribution. The random-fu package uses the Exponential distribution. # Performance > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-}  > {-# LANGUAGE FlexibleContexts #-}  > import Control.Monad > import Data.Random > import qualified Data.Random.Distribution.Normal as N  > import Data.Random.Source.PureMT > import Control.Monad.State  Here’s the naïve implementation. > naiveReject :: Double -> RVar Double > naiveReject x = doit > where > doit = do > y <- N.stdNormal > if y < x > then doit > else return y  And here’s an implementation using random-fu. > expReject :: Double -> RVar Double > expReject x = N.normalTail x  Let’s try running both of them > n :: Int > n = 10000000  > lower :: Double > lower = 2.0  > testExp :: [Double] > testExp = evalState (replicateM n$ sample (expReject lower)) (pureMT 3)

> testNaive :: [Double]
> testNaive = evalState (replicateM n $sample (naiveReject lower)) (pureMT 3)  > main :: IO () > main = do > print$ sum testExp
>   print sum testNaive  Let’s try building and running both the naïve and better tuned versions. ghc -O2 CompareRejects.hs As we can see below we get 59.98s and 4.28s, a compelling reason to use the tuned version. And the difference in performance will get worse the less of the tail we wish to sample from. ## Tuned 2.3731610476911187e7 11,934,195,432 bytes allocated in the heap 5,257,328 bytes copied during GC 44,312 bytes maximum residency (2 sample(s)) 21,224 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 23145 colls, 0 par 0.09s 0.11s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 4.19s ( 4.26s elapsed) GC time 0.09s ( 0.11s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.28s ( 4.37s elapsed) %GC time 2.2% (2.6% elapsed) Alloc rate 2,851,397,967 bytes per MUT second Productivity 97.8% of total user, 95.7% of total elapsed ## Naïve 2.3732073159369867e7 260,450,762,656 bytes allocated in the heap 111,891,960 bytes copied during GC 85,536 bytes maximum residency (2 sample(s)) 76,112 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 512768 colls, 0 par 1.86s 2.24s 0.0000s 0.0008s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 58.12s ( 58.99s elapsed) GC time 1.86s ( 2.24s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 59.98s ( 61.23s elapsed) %GC time 3.1% (3.7% elapsed) Alloc rate 4,481,408,869 bytes per MUT second Productivity 96.9% of total user, 94.9% of total elapsed # Bibliography Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer-Verlag. http://books.google.co.uk/books?id=mEw\_AQAAIAAJ. Marsaglia, G. 1964. “Generating a Variable from the Tail of the Normal Distribution.” J-Technometrics 6 (1): 101–2. # Stochastic Volatility # Introduction Simple models for e.g. financial option pricing assume that the volatility of an index or a stock is constant, see here for example. However, simple observation of time series show that this is not the case; if it were then the log returns would be white noise One approach which addresses this, GARCH (Generalised AutoRegressive Conditional Heteroskedasticity), models the evolution of volatility deterministically. Stochastic volatility models treat the volatility of a return on an asset, such as an option to buy a security, as a Hidden Markov Model (HMM). Typically, the observable data consist of noisy mean-corrected returns on an underlying asset at equally spaced time points. There is evidence that Stochastic Volatility models (Kim, Shephard, and Chib (1998)) offer increased flexibility over the GARCH family, e.g. see Geweke (1994), Fridman and Harris (1998) and Jacquier, Polson, and Rossi (1994). Despite this and judging by the numbers of questions on the R Special Interest Group on Finance mailing list, the use of GARCH in practice far outweighs that of Stochastic Volatility. Reasons cited are the multiplicity of estimation methods for the latter and the lack of packages (but see here for a recent improvement to the paucity of packages). In their tutorial on particle filtering, Doucet and Johansen (2011) give an example of stochastic volatility. We save this approach for future blog posts and follow Lopes and Polson and the excellent lecture notes by Hedibert Lopes. Here’s the model. \displaystyle \begin{aligned} H_0 &\sim {\mathcal{N}}\left( m_0, C_0\right) \\ H_t &= \mu + \phi H_{t-1} + \tau \eta_t \\ Y_n &= \beta \exp(H_t / 2) \epsilon_n \\ \end{aligned} We wish to estimate $\mu, \phi, \tau$ and $\boldsymbol{h}$. To do this via a Gibbs sampler we need to sample from $\displaystyle {p \left( \mu, \phi, \tau \,\vert\, \boldsymbol{h}, \boldsymbol{y} \right)} \quad \text{and} \quad {p \left( \boldsymbol{h} \,\vert\, \mu, \phi, \tau, \boldsymbol{y} \right)}$ ## Haskell Preamble > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-}  > {-# LANGUAGE RecursiveDo #-} > {-# LANGUAGE ExplicitForAll #-} > {-# LANGUAGE TypeOperators #-} > {-# LANGUAGE TypeFamilies #-} > {-# LANGUAGE ScopedTypeVariables #-} > {-# LANGUAGE DataKinds #-} > {-# LANGUAGE FlexibleContexts #-}  > module StochVol ( > bigM > , bigM0 > , runMC > , ys > , vols > , expectationTau2 > , varianceTau2 > ) where  > import Numeric.LinearAlgebra.HMatrix hiding ( (===), (|||), Element, > (<>), (#>), inv ) > import qualified Numeric.LinearAlgebra.Static as S > import Numeric.LinearAlgebra.Static ( (<>) ) > import GHC.TypeLits > import Data.Proxy > import Data.Maybe ( fromJust )  > import Data.Random > import Data.Random.Source.PureMT > import Control.Monad.Fix > import Control.Monad.State.Lazy > import Control.Monad.Writer hiding ( (<>) ) > import Control.Monad.Loops > import Control.Applicative  > import qualified Data.Vector as V  > inv :: (KnownNat n, (1 <=? n) ~ 'True) => S.Sq n -> S.Sq n > inv m = fromJust S.linSolve m S.eye

> infixr 8 #>
> (#>) :: (KnownNat m, KnownNat n) => S.L m n -> S.R n -> S.R m
> (#>) = (S.#>)

> type StatsM a = RVarT (Writer [((Double, Double), Double)]) a

> (|||) :: (KnownNat ((+) r1 r2), KnownNat r2, KnownNat c, KnownNat r1) =>
>          S.L c r1 -> S.L c r2 -> S.L c ((+) r1 r2)
> (|||) = (S.¦)


# Marginal Distribution for Parameters

Let us take a prior that is standard for linear regression

$\displaystyle (\boldsymbol{\theta}, \tau^2) \sim {\mathcal{NIG}}(\boldsymbol{\theta}_0, V_0, \nu_0, s_0^2)$

where $\boldsymbol{\theta} = (\mu, \phi)^\top$ and use standard results for linear regression to obtain the required marginal distribution.

That the prior is Normal Inverse Gamma (${\cal{NIG}}$) means

\displaystyle \begin{aligned} \boldsymbol{\theta} \, | \, \tau^2 & \sim {\cal{N}}(\boldsymbol{\theta}_0, \tau^2 V_0) \\ \tau^2 & \sim {\cal{IG}}(\nu_0 / 2, \nu_0 s_0^2 / 2) \end{aligned}

Standard Bayesian analysis for regression tells us that the (conditional) posterior distribution for

$\displaystyle y_i = \beta + \alpha x_i + \epsilon_i$

where the $\{\epsilon_i\}$ are IID normal with variance $\sigma^2$ is given by

$\displaystyle {p \left( \alpha, \beta, \eta \,\vert\, \boldsymbol{y}, \boldsymbol{x} \right)} = {\cal{N}}((\alpha, \beta); \mu_n, \sigma^2\Lambda_n^{-1})\,{\cal{IG}}(a_n, b_n)$

with

$\displaystyle \Lambda_n = X_n^\top X_n + \Lambda_0$

$\displaystyle \begin{matrix} \mu_n = \Lambda_n^{-1}({X_n}^{\top}{X_n}\hat{\boldsymbol{\beta}}_n + \Lambda_0\mu_0) & \textrm{where} & \hat{\boldsymbol\beta}_n = ({X}_n^{\rm T}{X}_n)^{-1}{X}_n^{\rm T}\boldsymbol{y}_n \end{matrix}$

$\displaystyle \begin{matrix} a_n = \frac{n}{2} + a_0 & \quad & b_n = b_0 + \frac{1}{2}(\boldsymbol{y}^\top\boldsymbol{y} + \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n) \end{matrix}$

## Recursive Form

We can re-write the above recursively. We do not need to for this blog article but it will be required in any future blog article which uses Sequential Monte Carlo techniques.

$\displaystyle \Lambda_n = \boldsymbol{x}_n^\top \boldsymbol{x}_n + \Lambda_{n-1}$

Furthermore

$\displaystyle \Lambda_{n}\mu_{n} = {X}_{n}^{\rm T}\boldsymbol{y}_{n} + \Lambda_0\mu_0 = {X}_{n-1}^{\rm T}\boldsymbol{y}_{n-1} + \boldsymbol{x}_n^\top y_n + \Lambda_0\mu_0 = \Lambda_{n-1}\mu_{n-1} + \boldsymbol{x}_n^\top y_n$

so we can write

$\displaystyle \boldsymbol{\mu}_n = \Lambda_n^{-1}(\Lambda_{n-1}\mu_{n-1} + \boldsymbol{x}_n^\top y_n)$

and

$\displaystyle \begin{matrix} a_n = a_{n-1} + \frac{1}{2} & \quad & b_n = b_{n-1} + \frac{1}{2}\big[(y_n - \boldsymbol{\mu}_n^\top \boldsymbol{x}_n)y_n + (\boldsymbol{\mu}_{n-1} - \boldsymbol{\mu}_{n})^\top \Lambda_{n-1}\boldsymbol{\mu}_{n-1}\big] \end{matrix}$

## Specialising

In the case of our model we can specialise the non-recursive equations as

$\displaystyle \Lambda_n = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \ldots & \ldots \\ 1 & x_n \end{bmatrix} + \Lambda_0$

$\displaystyle \begin{matrix} \mu_n = (\Lambda_n)^{-1}({X_n}^{\top}{X_n}\hat{\boldsymbol{\beta}}_n + \Lambda_0\mu_0) & \textrm{where} & \hat{\boldsymbol\beta}_n = ({X}_n^{\rm T}{X}_n)^{-1}{X}_n^{\rm T}\boldsymbol{x}_{1:n} \end{matrix}$

$\displaystyle \begin{matrix} a_n = \frac{n}{2} + a_0 & \quad & b_n = b_0 + \frac{1}{2}(\boldsymbol{x}_{1:n}^\top\boldsymbol{x}_{1:n} + \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n) \end{matrix}$

Let’s re-write the notation to fit our model.

$\displaystyle \Lambda_n = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ h_1 & h_2 & \ldots & h_n \end{bmatrix} \begin{bmatrix} 1 & h_1 \\ 1 & h_2 \\ \ldots & \ldots \\ 1 & h_n \end{bmatrix} + \Lambda_0$

$\displaystyle \begin{matrix} \mu_n = (\Lambda_n)^{-1}({H_n}^{\top}{H_n}\hat{\boldsymbol{\theta}}_n + \Lambda_0\mu_0) & \textrm{where} & \hat{\boldsymbol\theta}_n = ({H}_n^{\rm T}{H}_n)^{-1}{H}_n^{\rm T}\boldsymbol{x}_{1:n} \end{matrix}$

$\displaystyle \begin{matrix} a_n = \frac{n}{2} + a_0 & \quad & b_n = b_0 + \frac{1}{2}(\boldsymbol{x}_{1:n}^\top\boldsymbol{x}_{1:n} + \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n) \end{matrix}$

Sample from ${p \left( \boldsymbol{\theta}, \tau^2 \,\vert\, \boldsymbol{h}, \boldsymbol{y} \right)} \sim {\mathcal{NIG}}(\boldsymbol{\theta}_1, V_1, \nu_1, s_1^2)$

We can implement this in Haskell as

> sampleParms ::
>   forall n m .
>   (KnownNat n, (1 <=? n) ~ 'True) =>
>   S.R n -> S.L n 2 -> S.R 2 -> S.Sq 2 -> Double -> Double ->
>   RVarT m (S.R 2, Double)
> sampleParms y bigX theta_0 bigLambda_0 a_0 s_02 = do
>   let n = natVal (Proxy :: Proxy n)
>       a_n = 0.5 * (a_0 + fromIntegral n)
>       bigLambda_n = bigLambda_0 + (tr bigX) <> bigX
>       invBigLambda_n = inv bigLambda_n
>       theta_n = invBigLambda_n #> ((tr bigX) #> y + (tr bigLambda_0) #> theta_0)
>       b_0 = 0.5 * a_0 * s_02
>       b_n = b_0 +
>             0.5 * (S.extract (S.row y <> S.col y)!0!0) +
>             0.5 * (S.extract (S.row theta_0 <> bigLambda_0 <> S.col theta_0)!0!0) -
>             0.5 * (S.extract (S.row theta_n <> bigLambda_n <> S.col theta_n)!0!0)
>   g <- rvarT (Gamma a_n (recip b_n))
>   let s2 = recip g
>       invBigLambda_n' = m <> invBigLambda_n
>         where
>           m = S.diag S.vector (replicate 2 s2) > m1 <- rvarT StdNormal > m2 <- rvarT StdNormal > let theta_n' :: S.R 2 > theta_n' = theta_n + S.chol (S.sym invBigLambda_n') #> (S.vector [m1, m2]) > return (theta_n', s2)  # Marginal Distribution for State ## Marginal for $H_0$ Using a standard result about conjugate priors and since we have $\displaystyle h_0 \sim {\cal{N}}(m_0,C_0) \quad h_1 \vert h_0 \sim {\cal{N}}(\mu + \phi h_0, \tau^2)$ we can deduce $\displaystyle h_0 \vert h_1 \sim {\cal{N}}(m_1,C_1)$ where \displaystyle \begin{aligned} \frac{1}{C_1} &= \frac{1}{C_0} + \frac{\phi^2}{\tau^2} \\ \frac{m_1}{C_1} &= \frac{m_0}{C_0} + \frac{\phi(h_1 - \mu)}{\tau^2} \end{aligned} > sampleH0 :: Double -> > Double -> > V.Vector Double -> > Double -> > Double -> > Double -> > RVarT m Double > sampleH0 iC0 iC0m0 hs mu phi tau2 = do > let var = recip (iC0 + phi^2 / tau2)
>       mean = var * (iC0m0 + phi * ((hs V.! 0) - mu) / tau2)
>   rvarT (Normal mean (sqrt var))


## Marginal for $H_1 \ldots H_n$

From the state equation, we have

\displaystyle \begin{aligned} H_{t+1} &= \mu + \phi H_{t} + \tau \eta_t \\ \phi^2 H_{t} &= -\phi\mu + \phi H_{t+1} - \phi \tau \eta_t \\ \end{aligned}

We also have

\displaystyle \begin{aligned} H_{t} &= \mu + \phi H_{t-1} + \tau \eta_{t-1} \\ \end{aligned}

Adding the two expressions together gives

\displaystyle \begin{aligned} (1 + \phi^2)H_{t} &= \phi (H_{t-1} + H_{t+1}) + \mu (1 - \phi) + \tau(\eta_{t-1} - \phi\eta_t) \\ H_{t} &= \frac{\phi}{1 + \phi^2} (H_{t-1} + H_{t+1}) + \mu \frac{1 - \phi}{1 + \phi^2} + \frac{\tau}{1 + \phi^2}(\eta_{t-1} - \phi\eta_t) \\ \end{aligned}

Since $\{\eta_t\}$ are standard normal, then $H_t$ conditional on $H_{t-1}$ and $H_{t+1}$ is normally distributed, and

\displaystyle \begin{aligned} \mathbb{E}(H_n\mid H_{n-1}, H_{n+1}) &= \frac{1 - \phi}{1+\phi^2}\mu + \frac{\phi}{1+\phi^2}(H_{n-1} + H_{n+1}) \\ \mathbb{V}(H_n\mid H_{n-1}, H_{n+1}) &= \frac{\tau^2}{1+\phi^2} \end{aligned}

We also have

$\displaystyle h_{n+1} \vert h_n \sim {\cal{N}}(\mu + \phi h_n, \tau^2)$

Writing

$\displaystyle \boldsymbol{h}_{-t} = \begin{bmatrix} h_0, & h_1, & \ldots, & h_{t-1}, & h_{t+1}, & \ldots, & h_{n-1}, & h_n \end{bmatrix}$

by Bayes’ Theorem we have

$\displaystyle {p \left( h_t \,\vert\, \boldsymbol{h}_{-t}, \theta, \boldsymbol{y} \right)} \propto {p \left( y_t \,\vert\, h_t \right)} {p \left( h_t \,\vert\, \boldsymbol{h}_{-t}, \theta, y_{-t} \right)} = f_{\cal{N}}(y_t;0,e^{h_t}) f_{\cal{N}}(h_t;\mu_t,\nu_t^2)$

where $f_{\cal{N}}(x;\mu,\sigma^2)$ is the probability density function of a normal distribution.

We can sample from this using Metropolis

1. For each $t$, sample $h_t^\flat$ from ${\cal{N}}(h_t, \gamma^2)$ where $\gamma^2$ is the tuning variance.

2. For each $t=1, \ldots, n$, compute the acceptance probability

$\displaystyle p_t = \min{\Bigg(\frac{f_{\cal{N}}(h^\flat_t;\mu_t,\nu_t^2) f_{\cal{N}}(y_t;0,e^{h^\flat_t})}{f_{\cal{N}}(h_t;\mu_t,\nu_t^2) f_{\cal{N}}(y_t;0,e^{h_t})}, 1 \Bigg)}$

1. For each $t$, compute a new value of $h_t$

$\displaystyle h^\sharp_t = \begin{cases} h^\flat_t \text{with probability } p_t \\ h_t \text{with probability } 1 - p_t \end{cases}$

> metropolis :: V.Vector Double ->
>               Double ->
>               Double ->
>               Double ->
>               Double ->
>               V.Vector Double ->
>               Double ->
>               RVarT m (V.Vector Double)
> metropolis ys mu phi tau2 h0 hs vh = do
>   let eta2s = V.replicate (n-1) (tau2 / (1 + phi^2)) V.snoc tau2
>       etas  = V.map sqrt eta2s
>       coef1 = (1 - phi) / (1 + phi^2) * mu
>       coef2 = phi / (1 + phi^2)
>       mu_n  = mu + phi * (hs V.! (n-1))
>       mu_1  = coef1 + coef2 * ((hs V.! 1) + h0)
>       innerMus = V.zipWith (\hp1 hm1 -> coef1 + coef2 * (hp1 + hm1)) (V.tail (V.tail hs)) hs
>       mus = mu_1 V.cons innerMus V.snoc mu_n
>   hs' <- V.mapM (\mu -> rvarT (Normal mu vh)) hs
>   let num1s = V.zipWith3 (\mu eta h -> logPdf (Normal mu eta) h) mus etas hs'
>       num2s = V.zipWith (\y h -> logPdf (Normal 0.0 (exp (0.5 * h))) y) ys hs'
>       nums  = V.zipWith (+) num1s num2s
>       den1s = V.zipWith3 (\mu eta h -> logPdf (Normal mu eta) h) mus etas hs
>       den2s = V.zipWith (\y h -> logPdf (Normal 0.0 (exp (0.5 * h))) y) ys hs
>       dens = V.zipWith (+) den1s den2s
>   us <- V.replicate n <$> rvarT StdUniform > let ls = V.zipWith (\n d -> min 0.0 (n - d)) nums dens > return$ V.zipWith4 (\u l h h' -> if log u < l then h' else h) us ls hs hs'


# Markov Chain Monte Carlo

Now we can write down a single step for our Gibbs sampler, sampling from each marginal in turn.

> singleStep :: Double -> V.Vector Double ->
>               (Double, Double, Double, Double, V.Vector Double) ->
>               StatsM (Double, Double, Double, Double, V.Vector Double)
> singleStep vh y (mu, phi, tau2, h0, h) = do
>   lift $tell [((mu, phi),tau2)] > hNew <- metropolis y mu phi tau2 h0 h vh > h0New <- sampleH0 iC0 iC0m0 hNew mu phi tau2 > let bigX' = (S.col$ S.vector $replicate n 1.0) > ||| > (S.col$ S.vector $V.toList$ h0New V.cons V.init hNew)
>       bigX =  bigX' asTypeOf (snd $valAndType nT) > newParms <- sampleParms (S.vector$ V.toList h) bigX (S.vector [mu0, phi0]) invBigV0 nu0 s02
>   return ( (S.extract (fst newParms))!0
>          , (S.extract (fst newParms))!1
>          , snd newParms
>          , h0New
>          , hNew
>          )


# Testing

Let’s create some test data.

> mu', phi', tau2', tau' :: Double
> mu'   = -0.00645
> phi'  =  0.99
> tau2' =  0.15^2
> tau'  = sqrt tau2'


We need to create a statically typed matrix with one dimension the same size as the data so we tie the data size value to the required type.

> nT :: Proxy 500
> nT = Proxy

> valAndType :: KnownNat n => Proxy n -> (Int, S.L n 2)
> valAndType x = (fromIntegral $natVal x, undefined)  > n :: Int > n = fst$ valAndType nT


Arbitrarily let us start the process at

> h0 :: Double
> h0 = 0.0


We define the process as a stream (aka co-recursively) using the Haskell recursive do construct. It is not necessary to do this but streams are a natural way to think of stochastic processes.

> hs, vols, sds, ys :: V.Vector Double
> hs = V.fromList $take n$ fst $runState hsAux (pureMT 1) > where > hsAux :: (MonadFix m, MonadRandom m) => m [Double] > hsAux = mdo { x0 <- sample (Normal (mu' + phi' * h0) tau') > ; xs <- mapM (\x -> sample (Normal (mu' + phi' * x) tau')) (x0:xs) > ; return xs > }  > vols = V.map exp hs  We can plot the volatility (which we cannot observe directly). And we can plot the log returns. > sds = V.map sqrt vols  > ys = fst$ runState ysAux (pureMT 2)
>   where
>     ysAux = V.mapM (\sd -> sample (Normal 0.0 sd)) sds


We start with a vague prior for $H_0$

> m0, c0 :: Double
> m0 = 0.0
> c0 = 100.0


For convenience

> iC0, iC0m0 :: Double
> iC0 = recip c0
> iC0m0  = iC0 * m0


Rather than really sample from priors for $\mu, \phi$ and $\tau^2$ let us cheat and assume we sampled the simulated values!

> mu0, phi0, tau20 :: Double
> mu0   = -0.00645
> phi0  =  0.99
> tau20 =  0.15^2


But that we are still very uncertain about them

> bigV0, invBigV0 :: S.Sq 2
> bigV0 = S.diag $S.fromList [100.0, 100.0] > invBigV0 = inv bigV0  > nu0, s02 :: Double > nu0 = 10.0 > s02 = (nu0 - 2) / nu0 * tau20  Note that for the inverse gamma this gives > expectationTau2, varianceTau2 :: Double > expectationTau2 = (nu0 * s02 / 2) / ((nu0 / 2) - 1) > varianceTau2 = (nu0 * s02 / 2)^2 / (((nu0 / 2) - 1)^2 * ((nu0 / 2) - 2))  ghci> expectationTau2 2.25e-2 ghci> varianceTau2 1.6874999999999998e-4  ## Running the Markov Chain Tuning parameter > vh :: Double > vh = 0.1  The burn-in and sample sizes may be too low for actual estimation but will suffice for a demonstration. > bigM, bigM0 :: Int > bigM0 = 2000 > bigM = 2000  > multiStep :: StatsM (Double, Double, Double, Double, V.Vector Double) > multiStep = iterateM_ (singleStep vh ys) (mu0, phi0, tau20, h0, vols)  > runMC :: [((Double, Double), Double)] > runMC = take bigM$ drop bigM0 > execWriter (evalStateT (sample multiStep) (pureMT 42))  And now we can look at the distributions of our estimates # Bibliography Doucet, Arnaud, and Adam M Johansen. 2011. “A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later.” In Handbook of Nonlinear Filtering. Oxford, UK: Oxford University Press. Fridman, Moshe, and Lawrence Harris. 1998. “A Maximum Likelihood Approach for Non-Gaussian Stochastic Volatility Models.” Journal of Business & Economic Statistics 16 (3): 284–91. Geweke, John. 1994. “Bayesian Comparison of Econometric Models.” Jacquier, Eric, Nicholas G. Polson, and Peter E. Rossi. 1994. “Bayesian Analysis of Stochastic Volatility Models.” Kim, Sangjoon, Neil Shephard, and Siddhartha Chib. 1998. “Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models.” Review of Economic Studies 65 (3): 361–93. http://ideas.repec.org/a/bla/restud/v65y1998i3p361-93.html. # Fun with (Extended Kalman) Filters # Summary An extended Kalman filter in Haskell using type level literals and automatic differentiation to provide some guarantees of correctness. # Population Growth Suppose we wish to model population growth of bees via the logistic equation \displaystyle \begin{aligned} \dot{p} & = rp\Big(1 - \frac{p}{k}\Big) \end{aligned} We assume the growth rate $r$ is unknown and drawn from a normal distribution ${\cal{N}}(\mu_r, \sigma_r^2)$ but the carrying capacity $k$ is known and we wish to estimate the growth rate by observing noisy values $y_i$ of the population at discrete times $t_0 = 0, t_1 = \Delta T, t_2 = 2\Delta T, \ldots$. Note that $p_t$ is entirely deterministic and its stochasticity is only as a result of the fact that the unknown parameter of the logistic equation is sampled from a normal distribution (we could for example be observing different colonies of bees and we know from the literature that bee populations obey the logistic equation and each colony will have different growth rates). ## Haskell Preamble > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-}  > {-# LANGUAGE DataKinds #-} > {-# LANGUAGE ScopedTypeVariables #-} > {-# LANGUAGE RankNTypes #-} > {-# LANGUAGE BangPatterns #-} > {-# LANGUAGE TypeOperators #-} > {-# LANGUAGE TypeFamilies #-}  > module FunWithKalman3 where  > import GHC.TypeLits > import Numeric.LinearAlgebra.Static > import Data.Maybe ( fromJust )  > import Numeric.AD  > import Data.Random.Source.PureMT > import Data.Random  > import Control.Monad.State > import qualified Control.Monad.Writer as W > import Control.Monad.Loops  ## Logistic Equation The logistic equation is a well known example of a dynamical system which has an analytic solution $\displaystyle p = \frac{kp_0\exp rt}{k + p_0(\exp rt - 1)}$ Here it is in Haskell > logit :: Floating a => a -> a -> a -> a > logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))  We observe a noisy value of population at regular time intervals (where $\Delta T$ is the time interval) \displaystyle \begin{aligned} p_i &= \frac{kp_0\exp r\Delta T i}{k + p_0(\exp r\Delta T i - 1)} \\ y_i &= p_i + \epsilon_i \end{aligned} Using the semi-group property of our dynamical system, we can re-write this as \displaystyle \begin{aligned} p_i &= \frac{kp_{i-1}\exp r\Delta T}{k + p_{i-1}(\exp r\Delta T - 1)} \\ y_i &= p_i + \epsilon_i \end{aligned} To convince yourself that this re-formulation is correct, think of the population as starting at $p_0$; after 1 time step it has reached $p_1$ and and after two time steps it has reached $p_2$ and this ought to be the same as the point reached after 1 time step starting at $p_1$, for example > oneStepFrom0, twoStepsFrom0, oneStepFrom1 :: Double > oneStepFrom0 = logit 0.1 1.0 (1 * 0.1) > twoStepsFrom0 = logit 0.1 1.0 (1 * 0.2) > oneStepFrom1 = logit oneStepFrom0 1.0 (1 * 0.1)  ghci> twoStepsFrom0 0.11949463171139338 ghci> oneStepFrom1 0.1194946317113934  We would like to infer the growth rate not just be able to predict the population so we need to add another variable to our model. \displaystyle \begin{aligned} r_i &= r_{i-1} \\ p_i &= \frac{kp_{i-1}\exp r_{i-1}\Delta T}{k + p_{i-1}(\exp r_{i-1}\Delta T - 1)} \\ y_i &= \begin{bmatrix}0 & 1\end{bmatrix}\begin{bmatrix}r_i \\ p_i\end{bmatrix} + \begin{bmatrix}0 \\ \epsilon_i\end{bmatrix} \end{aligned} # Extended Kalman This is almost in the form suitable for estimation using a Kalman filter but the dependency of the state on the previous state is non-linear. We can modify the Kalman filter to create the extended Kalman filter (EKF) by making a linear approximation. Since the measurement update is trivially linear (even in this more general form), the measurement update step remains unchanged. \displaystyle \begin{aligned} \boldsymbol{v}_i & \triangleq \boldsymbol{y}_i - \boldsymbol{g}(\hat{\boldsymbol{x}}^\flat_i) \\ \boldsymbol{S}_i & \triangleq \boldsymbol{G}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{G}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\ \boldsymbol{K}_i & \triangleq \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{G}_i^\top\boldsymbol{S}^{-1}_i \\ \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \boldsymbol{K}_i\boldsymbol{v}_i \\ \hat{\boldsymbol{\Sigma}}_i &\triangleq \hat{\boldsymbol{\Sigma}}^\flat_i - \boldsymbol{K}_i\boldsymbol{S}_i\boldsymbol{K}^\top_i \end{aligned} By Taylor we have $\displaystyle \boldsymbol{a}(\boldsymbol{x}) \approx \boldsymbol{a}(\boldsymbol{m}) + \boldsymbol{A}_{\boldsymbol{x}}(\boldsymbol{m})\delta\boldsymbol{x}$ where $\boldsymbol{A}_{\boldsymbol{x}}(\boldsymbol{m})$ is the Jacobian of $\boldsymbol{a}$ evaluated at $\boldsymbol{m}$ (for the exposition of the extended filter we take $\boldsymbol{a}$ to be vector valued hence the use of a bold font). We take $\delta\boldsymbol{x}$ to be normally distributed with mean of 0 and ignore any difficulties there may be with using Taylor with stochastic variables. Applying this at $\boldsymbol{m} = \hat{\boldsymbol{x}}_{i-1}$ we have $\displaystyle \boldsymbol{x}_i = \boldsymbol{a}(\hat{\boldsymbol{x}}_{i-1}) + \boldsymbol{A}_{\boldsymbol{x}}(\hat{\boldsymbol{x}}_{i-1})(\boldsymbol{x}_{i-1} - \hat{\boldsymbol{x}}_{i-1}) + \boldsymbol{\epsilon}_i$ Using the same reasoning as we did as for Kalman filters and writing $\boldsymbol{A}_{i-1}$ for $\boldsymbol{A}_{\boldsymbol{x}}(\hat{\boldsymbol{x}}_{i-1})$ we obtain \displaystyle \begin{aligned} \hat{\boldsymbol{x}}^\flat_i &= \boldsymbol{a}(\hat{\boldsymbol{x}}_{i-1}) \\ \hat{\boldsymbol{\Sigma}}^\flat_i &= \boldsymbol{A}_{i-1} \hat{\boldsymbol{\Sigma}}_{i-1} \boldsymbol{A}_{i-1}^\top + \boldsymbol{\Sigma}^{(x)}_{i-1} \end{aligned} ## Haskell Implementation Note that we pass in the Jacobian of the update function as a function itself in the case of the extended Kalman filter rather than the matrix representing the linear function as we do in the case of the classical Kalman filter. > k, p0 :: Floating a => a > k = 1.0 > p0 = 0.1 * k  > r, deltaT :: Floating a => a > r = 10.0 > deltaT = 0.0005  Relating ad and hmatrix is somewhat unpleasant but this can probably be ameliorated by defining a suitable datatype. > a :: R 2 -> R 2 > a rpPrev = rNew # pNew > where > (r, pPrev) = headTail rpPrev > rNew :: R 1 > rNew = konst r > > (p, _) = headTail pPrev > pNew :: R 1 > pNew = fromList [logit p k (r * deltaT)]

> bigA :: R 2 -> Sq 2
> bigA rp = fromList $concat$ j [r, p]
>   where
>     (r, ps) = headTail rp
>     (p,  _) = headTail ps
>     j = jacobian (\[r, p] -> [r, logit p k (r * deltaT)])


For some reason, hmatrix with static guarantees does not yet provide an inverse function for matrices.

> inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n
> inv m = fromJust $linSolve m eye  Here is the extended Kalman filter itself. The type signatures on the expressions inside the function are not necessary but did help the implementor discover a bug in the mathematical derivation and will hopefully help the reader. > outer :: forall m n . (KnownNat m, KnownNat n, > (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) => > R n -> Sq n -> > L m n -> Sq m -> > (R n -> R n) -> (R n -> Sq n) -> Sq n -> > [R m] -> > [(R n, Sq n)] > outer muPrior sigmaPrior bigH bigSigmaY > littleA bigABuilder bigSigmaX ys = result > where > result = scanl update (muPrior, sigmaPrior) ys > > update :: (R n, Sq n) -> R m -> (R n, Sq n) > update (xHatFlat, bigSigmaHatFlat) y = > (xHatFlatNew, bigSigmaHatFlatNew) > where > > v :: R m > v = y - (bigH #> xHatFlat) > > bigS :: Sq m > bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY > > bigK :: L n m > bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS) > > xHat :: R n > xHat = xHatFlat + bigK #> v > > bigSigmaHat :: Sq n > bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK) > > bigA :: Sq n > bigA = bigABuilder xHat > > xHatFlatNew :: R n > xHatFlatNew = littleA xHat > > bigSigmaHatFlatNew :: Sq n > bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX  Now let us create some sample data. > obsVariance :: Double > obsVariance = 1e-2  > bigSigmaY :: Sq 1 > bigSigmaY = fromList [obsVariance]  > nObs :: Int > nObs = 300  > singleSample :: Double -> RVarT (W.Writer [Double]) Double > singleSample p0 = do > epsilon <- rvarT (Normal 0.0 obsVariance) > let p1 = logit p0 k (r * deltaT) > lift$ W.tell [p1 + epsilon]
>   return p1

> streamSample :: RVarT (W.Writer [Double]) Double
> streamSample = iterateM_ singleSample p0

> samples :: [Double]
> samples = take nObs $snd$
>           W.runWriter (evalStateT (sample streamSample) (pureMT 3))


We created our data with a growth rate of

ghci> r
10.0


but let us pretend that we have read the literature on growth rates of bee colonies and we have some big doubts about growth rates but we are almost certain about the size of the colony at $t=0$.

> muPrior :: R 2
> muPrior = fromList [5.0, 0.1]
>
> sigmaPrior :: Sq 2
> sigmaPrior = fromList [ 1e2, 0.0
>                       , 0.0, 1e-10
>                       ]


We only observe the population and not the rate itself.

> bigH :: L 1 2
> bigH = fromList [0.0, 1.0]


Strictly speaking this should be 0 but this is close enough.

> bigSigmaX :: Sq 2
> bigSigmaX = fromList [ 1e-10, 0.0
>                      , 0.0, 1e-10
>                      ]


Now we can run our filter and watch it switch away from our prior belief as it accumulates more and more evidence.

> test :: [(R 2, Sq 2)]
> test = outer muPrior sigmaPrior bigH bigSigmaY
>        a bigA bigSigmaX (map (fromList . return) samples)


# Introduction

Suppose we have a vector of weights which sum to 1.0 and we wish to sample n samples randomly according to these weights. There is a well known trick in Matlab / Octave using sampling from a uniform distribution.

num_particles = 2*10^7
likelihood = zeros(num_particles,1);
likelihood(:,1) = 1/num_particles;
[_,index] = histc(rand(num_particles,1),[0;cumsum(likelihood/sum(likelihood))]);
s = sum(index);

Using tic and toc this produces an answer with

Elapsed time is 10.7763 seconds.

I could find no equivalent function in Haskell nor could I easily find a binary search function.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}

> {-# LANGUAGE BangPatterns                 #-}

> import System.Random.MWC
> import qualified Data.Vector.Unboxed as V

> import qualified Data.Vector.Algorithms.Search as S

> import Data.Bits

> n :: Int
> n = 2*10^7


Let’s create some random data. For a change let’s use mwc-random rather than random-fu.

> vs  :: V.Vector Double
> vs = runST (create >>= (asGenST $\gen -> uniformVector gen n))  Again, I could find no equivalent of cumsum but we can write our own. > weightsV, cumSumWeightsV :: V.Vector Double > weightsV = V.replicate n (recip$ fromIntegral n)
> cumSumWeightsV = V.scanl (+) 0 weightsV


Binary search on a sorted vector is straightforward and a cumulative sum ensures that the vector is sorted.

> binarySearch :: (V.Unbox a, Ord a) =>
>                 V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
>   where
>     loop !l !u
>       | u <= l    = l
>       | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
>       where k = l + (u - l) shiftR 1

> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs


To see how well this performs, let’s sum the indices (of course, we wouldn’t do this in practice) as we did for the Matlab implementation.

> js :: V.Vector Int
> js = indices (V.tail cumSumWeightsV) vs

> main :: IO ()
> main = do
>   print $V.foldl' (+) 0 js  Using +RTS -s we get Total time 10.80s ( 11.06s elapsed) which is almost the same as the Matlab version. I did eventually find a binary search function in vector-algorithms and since one should not re-invent the wheel, let us try using it. > indices' :: (V.Unbox a, Ord a) => V.Vector a -> V.Vector a -> V.Vector Int > indices' sv x = runST$ do
>   st <- V.unsafeThaw (V.tail sv)
>   V.mapM (S.binarySearch st) x

> main' :: IO ()
> main' = do
>   print $V.foldl' (+) 0$ indices' cumSumWeightsV vs


Again using +RTS -s we get

Total   time   11.34s  ( 11.73s elapsed)

So the library version seems very slightly slower.

# Importance Sampling

Suppose we have an random variable $X$ with pdf $1/2\exp{-\lvert x\rvert}$ and we wish to find its second moment numerically. However, the random-fu package does not support sampling from such as distribution. We notice that

$\displaystyle \int_{-\infty}^\infty x^2 \frac{1}{2} \exp{-\lvert x\rvert} \mathrm{d}x = \int_{-\infty}^\infty x^2 \frac{\frac{1}{2} \exp{-\lvert x\rvert}} {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}} \frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}} \,\mathrm{d}x$

So we can sample from ${\cal{N}}(0, 4)$ and evaluate

$\displaystyle x^2 \frac{\frac{1}{2} \exp{-\lvert x\rvert}} {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}}$

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}

> module Importance where

> import Control.Monad
> import Data.Random.Source.PureMT
> import Data.Random
> import Data.Random.Distribution.Binomial
> import Data.Random.Distribution.Beta
> import qualified Control.Monad.Writer as W

> sampleImportance :: RVarT (W.Writer [Double]) ()
> sampleImportance = do
>   x <- rvarT $Normal 0.0 2.0 > let x2 = x^2 > u = x2 * 0.5 * exp (-(abs x)) > v = (exp ((-x2)/8)) * (recip (sqrt (8*pi))) > w = u / v > lift$ W.tell [w]
>   return ()

> runImportance :: Int -> [Double]
> runImportance n =
>   snd $> W.runWriter$
>   evalStateT (sample (replicateM n sampleImportance))
>              (pureMT 2)


We can run this 10,000 times to get an estimate.

ghci> import Formatting
ghci> format (fixed 2) (sum (runImportance 10000) / 10000)
"2.03"


Since we know that the $n$-th moment of the exponential distribution is $n! / \lambda^n$ where $\lambda$ is the rate (1 in this example), the exact answer is 2 which is not too far from our estimate using importance sampling.

The value of

$\displaystyle w(x) = \frac{1}{N}\frac{\frac{1}{2} \exp{-\lvert x\rvert}} {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}} = \frac{p(x)}{\pi(x)}$

is called the weight, $p$ is the pdf from which we wish to sample and $\pi$ is the pdf of the importance distribution.

# Importance Sampling Approximation of the Posterior

Suppose that the posterior distribution of a model in which we are interested has a complicated functional form and that we therefore wish to approximate it in some way. First assume that we wish to calculate the expectation of some arbitrary function $f$ of the parameters.

$\displaystyle {\mathbb{E}}(f({x}) \,\vert\, y_1, \ldots y_T) = \int_\Omega f({x}) p({x} \, \vert \, y_1, \ldots y_T) \,\mathrm{d}{x}$

Using Bayes

$\displaystyle \int_\Omega f({x}) {p\left(x \,\vert\, y_1, \ldots y_T\right)} \,\mathrm{d}{x} = \frac{1}{Z}\int_\Omega f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x) \,\mathrm{d}{x}$

where $Z$ is some normalizing constant.

As before we can re-write this using a proposal distribution $\pi(x)$

$\displaystyle \frac{1}{Z}\int_\Omega f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x) \,\mathrm{d}{x} = \frac{1}{Z}\int_\Omega \frac{f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x)}{\pi(x)}\pi(x) \,\mathrm{d}{x}$

We can now sample $X^{(i)} \sim \pi({x})$ repeatedly to obtain

$\displaystyle {\mathbb{E}}(f({x}) \,\vert\, y_1, \ldots y_T) \approx \frac{1}{ZN}\sum_1^N f({X^{(i)}}) \frac{p(y_1, \ldots y_T \, \vert \, {X^{(i)}})p({X^{(i)}})} {\pi({X^{(i)}})} = \sum_1^N w_if({X^{(i)}})$

where the weights $w_i$ are defined as before by

$\displaystyle w_i = \frac{1}{ZN} \frac{p(y_1, \ldots y_T \, \vert \, {X^{(i)}})p({X^{(i)}})} {\pi({X^{(i)}})}$

We follow Alex Cook and use the example from (Rerks-Ngarm et al. 2009). We take the prior as $\sim {\cal{Be}}(1,1)$ and use ${\cal{U}}(0.0,1.0)$ as the proposal distribution. In this case the proposal and the prior are identical just expressed differently and therefore cancel.

Note that we use the log of the pdf in our calculations otherwise we suffer from (silent) underflow, e.g.,

ghci> pdf (Binomial nv (0.4 :: Double)) xv
0.0


On the other hand if we use the log pdf form

ghci> logPdf (Binomial nv (0.4 :: Double)) xv
-3900.8941170876574

> xv, nv :: Int
> xv = 51
> nv = 8197

> sampleUniform :: RVarT (W.Writer [Double]) ()
> sampleUniform = do
>   x <- rvarT StdUniform
>   lift $W.tell [x] > return ()  > runSampler :: RVarT (W.Writer [Double]) () -> > Int -> Int -> [Double] > runSampler sampler seed n = > snd$
>   W.runWriter $> evalStateT (sample (replicateM n sampler)) > (pureMT (fromIntegral seed))  > sampleSize :: Int > sampleSize = 1000  > pv :: [Double] > pv = runSampler sampleUniform 2 sampleSize  > logWeightsRaw :: [Double] > logWeightsRaw = map (\p -> logPdf (Beta 1.0 1.0) p + > logPdf (Binomial nv p) xv - > logPdf StdUniform p) pv  > logWeightsMax :: Double > logWeightsMax = maximum logWeightsRaw > > weightsRaw :: [Double] > weightsRaw = map (\w -> exp (w - logWeightsMax)) logWeightsRaw  > weightsSum :: Double > weightsSum = sum weightsRaw  > weights :: [Double] > weights = map (/ weightsSum) weightsRaw  > meanPv :: Double > meanPv = sum$ zipWith (*) pv weights
>
> meanPv2 :: Double
> meanPv2 = sum $zipWith (\p w -> p * p * w) pv weights > > varPv :: Double > varPv = meanPv2 - meanPv * meanPv  We get the answer ghci> meanPv 6.400869727227364e-3  But if we look at the size of the weights and the effective sample size ghci> length$ filter (>= 1e-6) weights
9

ghci> (sum weights)^2 / (sum $map (^2) weights) 4.581078458313967  so we may not be getting a very good estimate. Let’s try > sampleNormal :: RVarT (W.Writer [Double]) () > sampleNormal = do > x <- rvarT$ Normal meanPv (sqrt varPv)
>   lift $W.tell [x] > return ()  > pvC :: [Double] > pvC = runSampler sampleNormal 3 sampleSize  > logWeightsRawC :: [Double] > logWeightsRawC = map (\p -> logPdf (Beta 1.0 1.0) p + > logPdf (Binomial nv p) xv - > logPdf (Normal meanPv (sqrt varPv)) p) pvC  > logWeightsMaxC :: Double > logWeightsMaxC = maximum logWeightsRawC > > weightsRawC :: [Double] > weightsRawC = map (\w -> exp (w - logWeightsMaxC)) logWeightsRawC  > weightsSumC :: Double > weightsSumC = sum weightsRawC  > weightsC :: [Double] > weightsC = map (/ weightsSumC) weightsRawC  > meanPvC :: Double > meanPvC = sum$ zipWith (*) pvC weightsC

> meanPvC2 :: Double
> meanPvC2 = sum $zipWith (\p w -> p * p * w) pvC weightsC > > varPvC :: Double > varPvC = meanPvC2 - meanPvC * meanPvC  Now the weights and the effective size are more re-assuring ghci> length$ filter (>= 1e-6) weightsC
1000

ghci> (sum weightsC)^2 / (sum \$ map (^2) weightsC)
967.113872888872


And we can take more confidence in the estimate

ghci> meanPvC
6.371225269833208e-3


# Bibliography

Rerks-Ngarm, Supachai, Punnee Pitisuttithum, Sorachai Nitayaphan, Jaranit Kaewkungwal, Joseph Chiu, Robert Paris, Nakorn Premsri, et al. 2009. “Vaccination with ALVAC and AIDSVAX to Prevent HIV-1 Infection in Thailand.” New England Journal of Medicine 361 (23) (December 3): 2209–2220. doi:10.1056/nejmoa0908492. http://dx.doi.org/10.1056/nejmoa0908492.