# Introduction

The equation of motion for a pendulum of unit length subject to Gaussian white noise is

$\displaystyle \frac{\mathrm{d}^2\alpha}{\mathrm{d}t^2} = -g\sin\alpha + w(t)$

We can discretize this via the usual Euler method

$\displaystyle \begin{bmatrix} x_{1,i} \\ x_{2,i} \end{bmatrix} = \begin{bmatrix} x_{1,i-1} + x_{2,i-1}\Delta t \\ x_{2,i-1} - g\sin x_{1,i-1}\Delta t \end{bmatrix} + \mathbf{q}_{i-1}$

where $q_i \sim {\mathcal{N}}(0,Q)$ and

$\displaystyle Q = \begin{bmatrix} \frac{q^c \Delta t^3}{3} & \frac{q^c \Delta t^2}{2} \\ \frac{q^c \Delta t^2}{2} & {q^c \Delta t} \end{bmatrix}$

The explanation of the precise form of the covariance matrix will be the subject of another blog post; for the purpose of exposition of using Stan and, in particular, Stan’s ability to handle ODEs, this detail is not important.

Instead of assuming that we know $g$ let us take it to be unknown and that we wish to infer its value using the pendulum as our experimental apparatus.

Stan is a probabilistic programming language which should be welll suited to perform such an inference. We use its interface via the R package rstan.

# A Stan and R Implementation

Let’s generate some samples using Stan but rather than generating exactly the model we have given above, instead let’s solve the differential equation and then add some noise. Of course this won’t quite give us samples from the model the parameters of which we wish to estimate but it will allow us to use Stan’s ODE solving capability.

Here’s the Stan

functions {
real[] pendulum(real t,
real[] y,
real[] theta,
real[] x_r,
int[] x_i) {
real dydt[2];
dydt[1] <- y[2];
dydt[2] <- - theta[1] * sin(y[1]);
return dydt;
}
}
data {
int<lower=1> T;
real y0[2];
real t0;
real ts[T];
real theta[1];
real sigma[2];
}
transformed data {
real x_r[0];
int x_i[0];
}
model {
}
generated quantities {
real y_hat[T,2];
y_hat <- integrate_ode(pendulum, y0, t0, ts, theta, x_r, x_i);
for (t in 1:T) {
y_hat[t,1] <- y_hat[t,1] + normal_rng(0,sigma[1]);
y_hat[t,2] <- y_hat[t,2] + normal_rng(0,sigma[2]);
}
}

And here’s the R to invoke it

library(rstan)
library(mvtnorm)

qc1 = 0.0001
deltaT = 0.01
nSamples = 100
m0 = c(1.6, 0)
g = 9.81
t0 = 0.0
ts = seq(deltaT,nSamples * deltaT,deltaT)

bigQ = matrix(c(qc1 * deltaT^3 / 3, qc1 * deltaT^2 / 2,
qc1 * deltaT^2 / 2,       qc1 * deltaT
),
nrow = 2,
ncol = 2,
byrow = TRUE
)

samples <- stan(file = 'Pendulum.stan',
data = list (
T  = nSamples,
y0 = m0,
t0 = t0,
ts = ts,
theta = array(g, dim = 1),
sigma = c(bigQ[1,1], bigQ[2,2]),
refresh = -1
),
algorithm="Fixed_param",
seed = 42,
chains = 1,
iter =1
)

We can plot the angle the pendulum subtends to the vertical over time. Note that this is not very noisy.

s <- extract(samples,permuted=FALSE)
plot(s[1,1,1:100])

Now let us suppose that we don’t know the value of $g$ and we can only observe the horizontal displacement.

zStan <- sin(s[1,1,1:nSamples])

Now we can use Stan to infer the value of $g$.

functions {
real[] pendulum(real t,
real[] y,
real[] theta,
real[] x_r,
int[] x_i) {
real dydt[2];
dydt[1] <- y[2];
dydt[2] <- - theta[1] * sin(y[1]);
return dydt;
}
}
data {
int<lower=1> T;
real y0[2];
real z[T];
real t0;
real ts[T];
}
transformed data {
real x_r[0];
int x_i[0];
}
parameters {
real theta[1];
vector<lower=0>[1] sigma;
}
model {
real y_hat[T,2];
real z_hat[T];
theta ~ normal(0,1);
sigma ~ cauchy(0,2.5);
y_hat <- integrate_ode(pendulum, y0, t0, ts, theta, x_r, x_i);
for (t in 1:T) {
z_hat[t] <- sin(y_hat[t,1]);
z[t] ~ normal(z_hat[t], sigma);
}
}

Here’s the R to invoke it.

estimates <- stan(file = 'PendulumInfer.stan',
data = list (
T  = nSamples,
y0 = m0,
z  = zStan,
t0 = t0,
ts = ts
),
seed = 42,
chains = 1,
iter = 1000,
warmup = 500,
refresh = -1
)
e <- extract(estimates,pars=c("theta[1]","sigma[1]","lp__"),permuted=TRUE)

This gives an estiamted valeu for $g$ of 9.809999 which is what we would hope.

Now let’s try adding some noise to our observations.

set.seed(42)
epsilons <- rmvnorm(n=nSamples,mean=c(0.0),sigma=bigR)

zStanNoisy <- sin(s[1,1,1:nSamples] + epsilons[,1])

estimatesNoisy <- stan(file = 'PendulumInfer.stan',
data = list (T  = nSamples,
y0 = m0,
z  = zStanNoisy,
t0 = t0,
ts = ts
),
seed = 42,
chains = 1,
iter = 1000,
warmup = 500,
refresh = -1
)
eNoisy <- extract(estimatesNoisy,pars=c("theta[1]","sigma[1]","lp__"),permuted=TRUE)

This gives an estiamted value for $g$ of 8.5871024 which is ok but not great.

# Postamble

library(knitr)
knit('Pendulum.Rmd')

And this from command line:

pandoc -s Pendulum.md --filter=./Include > PendulumExpanded.html

# Introduction

Let $\{X_t\}_{t \geq 1}$ be a (hidden) Markov process. By hidden, we mean that we are not able to observe it.

$\displaystyle X_1 \sim \mu(\centerdot) \quad X_t \,|\, (X_{t-1} = x) \sim f(\centerdot \,|\, x)$

And let $\{Y_t\}_{t \geq 1}$ be an observable Markov process such that

$\displaystyle Y_t \,|\, (X_{t} = x) \sim g(\centerdot \,|\, x)$

That is the observations are conditionally independent given the state of the hidden process.

As an example let us take the one given in Särkkä (2013) where the movement of a car is given by Newton’s laws of motion and the acceleration is modelled as white noise.

\displaystyle \begin{aligned} X_t &= AX_{t-1} + Q \epsilon_t \\ Y_t &= HX_t + R \eta_t \end{aligned}

Although we do not do so here, $A, Q, H$ and $R$ can be derived from the dynamics. For the purpose of this blog post, we note that they are given by

$\displaystyle A = \begin{bmatrix} 1 & 0 & \Delta t & 0 \\ 0 & 1 & 0 & \Delta t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} , \quad Q = \begin{bmatrix} \frac{q_1^c \Delta t^3}{3} & 0 & \frac{q_1^c \Delta t^2}{2} & 0 \\ 0 & \frac{q_2^c \Delta t^3}{3} & 0 & \frac{q_2^c \Delta t^2}{2} \\ \frac{q_1^c \Delta t^2}{2} & 0 & {q_1^c \Delta t} & 0 \\ 0 & \frac{q_2^c \Delta t^2}{2} & 0 & {q_2^c \Delta t} \end{bmatrix}$

and

$\displaystyle H = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix} , \quad R = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{bmatrix}$

We wish to determine the position and velocity of the car given noisy observations of the position. In general we need the distribution of the hidden path given the observable path. We use the notation $x_{m:n}$ to mean the path of $x$ starting a $m$ and finishing at $n$.

$\displaystyle p(x_{1:n} \,|\, y_{1:n}) = \frac{p(x_{1:n}, y_{1:n})}{p(y_{1:n})}$

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}

> {-# LANGUAGE FlexibleInstances            #-}
> {-# LANGUAGE MultiParamTypeClasses        #-}
> {-# LANGUAGE FlexibleContexts             #-}
> {-# LANGUAGE TypeFamilies                 #-}
> {-# LANGUAGE BangPatterns                 #-}
> {-# LANGUAGE GeneralizedNewtypeDeriving   #-}
> {-# LANGUAGE ScopedTypeVariables          #-}

> module ParticleSmoothing
>   ( simpleSamples
>   , carSamples
>   , testCar
>   , testSimple
>   ) where

> import Data.Random.Source.PureMT
> import Data.Random hiding ( StdNormal, Normal )
> import qualified Data.Random as R
> import Control.Monad.Writer hiding ( Any, All )
> import qualified Numeric.LinearAlgebra.HMatrix as H
> import Foreign.Storable ( Storable )
> import Data.Maybe ( fromJust )
> import Data.Bits ( shiftR )
> import qualified Data.Vector as V
> import qualified Data.Vector.Unboxed as U
> import System.Random.MWC

> import Data.Array.Repa ( Z(..), (:.)(..), Any(..), computeP,
>                          extent, DIM1, DIM2, slice, All(..)
>                        )
> import qualified Data.Array.Repa as Repa

> import qualified Control.Monad.Loops as ML

> import PrettyPrint ()
> import Text.PrettyPrint.HughesPJClass ( Pretty, pPrint )

> import Data.Vector.Unboxed.Deriving


# Some Theory

If we could sample $X_{1:n}^{(i)} \sim p(x_{1:n} \,|\, y_{1:n})$ then we could approximate the posterior as

$\displaystyle \hat{p}(x_{1:n} \,|\, y_{1:n}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:n}^{(i)}}(x_{1:n})$

If we wish to, we can create marginal estimates

$\displaystyle \hat{p}(x_k \,|\, y_{1:n}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{k}^{(i)}}(x_{k})$

When $k = N$, this is the filtering estimate.

## Standard Bayesian Recursion

Prediction

\displaystyle \begin{aligned} p(x_n \,|\, y_{1:n-1}) &= \int p(x_{n-1:n} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\ &= \int p(x_{n} \,|\, x_{n-1}, y_{1:n-1}) \, p(x_{n-1} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\ &= \int f(x_{n} \,|\, x_{n-1}) \, p(x_{n-1} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\ \end{aligned}

Update

\displaystyle \begin{aligned} p(x_n \,|\, y_{1:n}) &= \frac{p(y_n \,|\, x_n, y_{1:n-1}) \, p(x_n \,|\, y_{1:n-1})} {p(y_n \,|\, y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_n) \, p(x_n \,|\, y_{1:n-1})} {p(y_n \,|\, y_{1:n-1})} \end{aligned}

where by definition

$\displaystyle {p(y_n \,|\, y_{1:n-1})} = \int {g(y_n \,|\, x_n) \, p(x_n \,|\, y_{1:n-1})} \,\mathrm{d}x_n$

## Path Space Recursion

We have

\displaystyle \begin{aligned} p(x_{1:n} \,|\, y_{1:n}) &= \frac{p(x_{1:n}, y_{1:n})}{p(y_{1:n})} \\ &= \frac{p(x_n, y_n \,|\, x_{1:n-1}, y_{1:n-1})}{p(y_{1:n})} \, p(x_{1:n-1}, y_{1:n-1}) \\ &= \frac{p(y_n \,|\, x_{1:n}, y_{1:n-1}) \, p(x_n \,|\, x_{1:n-1}, y_{1:n-1}) }{p(y_{1:n})} \, p(x_{1:n-1}, y_{1:n-1}) \\ &= \frac{g(y_n \,|\, x_{n}) \, f(x_n \,|\, x_{n-1})}{p(y_n \,|\, y_{1:n-1})} \, \frac{p(x_{1:n-1}, y_{1:n-1})}{ \, p(y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, f(x_n \,|\, x_{n-1})}{p(y_n \,|\, y_{1:n-1})} \, {p(x_{1:n-1} \,|\,y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, \overbrace{f(x_n \,|\, x_{n-1}) \, {p(x_{1:n-1} \,|\,y_{1:n-1})}}^{\mathrm{predictive}\,p(x_{1:n} \,|\, y_{1:n-1})}} {p(y_n \,|\, y_{1:n-1})} \\ \end{aligned}

where by definition

$\displaystyle p(y_n \,|\, y_{1:n-1}) = \int g(y_n \,|\, x_n) \, p(x_{1:n} \,|\, y_{1:n-1}) \,\mathrm{d}x_{1:n}$

Prediction

$\displaystyle p(x_{1:n} \,|\, y_{1:n-1}) = f(x_n \,|\, x_{n-1}) \, {p(x_{1:n-1} \,|\,y_{1:n-1})}$

Update

$\displaystyle p(x_{1:n} \,|\, y_{1:n}) = \frac{g(y_n \,|\, x_{n}) \, {p(x_{1:n} \,|\, y_{1:n-1})}} {p(y_n \,|\, y_{1:n-1})}$

## Algorithm

The idea is to simulate paths using the recursion we derived above.

At time $n-1$ we have an approximating distribution

$\displaystyle \hat{p}(x_{1:n-1} \,|\, y_{1:n-1}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:n-1}}^{(i)}(x_{1:n-1})$

Sample $\tilde{X}_n^{(i)} \sim f(\centerdot \,|\, X_{n-1}^{(i)})$ and set $\tilde{X}_{1:n}^{(i)} = (\tilde{X}_{1:n-1}^{(i)}, \tilde{X}_n^{(i)})$. We then have an approximation of the prediction step

$\displaystyle \hat{p}(x_{1:n} \,|\, y_{1:n-1}) = \frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})$

Substituting

\displaystyle \begin{aligned} {\hat{p}(y_n \,|\, y_{1:n-1})} &= \int {g(y_n \,|\, x_n) \, \hat{p}(x_{1:n} \,|\, y_{1:n-1})} \,\mathrm{d}x_n \\ &= \int {g(y_n \,|\, x_n)}\frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n-1}}^{(i)}(x_{1:n}) \,\mathrm{d}x_n \\ &= \frac{1}{N}\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})} \end{aligned}

and again

\displaystyle \begin{aligned} \tilde{p}(x_{1:n} \,|\, y_{1:n}) &= \frac{g(y_n \,|\, x_{n}) \, {\hat{p}(x_{1:n} \,|\, y_{1:n-1})}} {\hat{p}(y_n \,|\, y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, \frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})} {\frac{1}{N}\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})}} \\ &= \frac{ \sum_{i=1}^N g(y_n \,|\, \tilde{X}_n^{(i)}) \, \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})} {\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})}} \\ &= \sum_{i=1}^N W_n^{(i)} \delta_{\tilde{X}_{1:n}^{(i)}} (x_{1:n}) \end{aligned}

where $W_n^{(i)} \propto g(y_n \,|\, \tilde{X}_n^{(i)})$ and $\sum_{i=1}^N W_n^{(i)} = 1$.

Now sample

$\displaystyle X_{1:n}^{(i)} \sim \tilde{p}(x_{1:n} \,|\, y_{1:n})$

Let’s specify some values for the example of the car moving in two dimensions.

> deltaT, sigma1, sigma2, qc1, qc2 :: Double
> deltaT = 0.1
> sigma1 = 1/2
> sigma2 = 1/2
> qc1 = 1
> qc2 = 1

> bigA :: H.Matrix Double
> bigA = (4 H.>< 4) bigAl

> bigAl :: [Double]
> bigAl = [1, 0 , deltaT,      0,
>          0, 1,       0, deltaT,
>          0, 0,       1,      0,
>          0, 0,       0,      1]

> bigQ :: H.Herm Double
> bigQ = H.trustSym $(4 H.>< 4) bigQl  > bigQl :: [Double] > bigQl = [qc1 * deltaT^3 / 3, 0, qc1 * deltaT^2 / 2, 0, > 0, qc2 * deltaT^3 / 3, 0, qc2 * deltaT^2 / 2, > qc1 * deltaT^2 / 2, 0, qc1 * deltaT, 0, > 0, qc2 * deltaT^2 / 2, 0, qc2 * deltaT]  > bigH :: H.Matrix Double > bigH = (2 H.>< 4) [1, 0, 0, 0, > 0, 1, 0, 0]  > bigR :: H.Herm Double > bigR = H.trustSym$ (2 H.>< 2) [sigma1^2,        0,
>                                        0, sigma2^2]

> m0 :: H.Vector Double
> m0 = H.fromList [0, 0, 1, -1]

> bigP0 :: H.Herm Double
> bigP0 = H.trustSym $H.ident 4  > n :: Int > n = 23  With these we generate hidden and observable sample path. > carSample :: MonadRandom m => > H.Vector Double -> > m (Maybe ((H.Vector Double, H.Vector Double), H.Vector Double)) > carSample xPrev = do > xNew <- sample$ rvar (Normal (bigA H.#> xPrev) bigQ)
>   yNew <- sample $rvar (Normal (bigH H.#> xNew) bigR) > return$ Just ((xNew, yNew), xNew)

> carSamples :: [(H.Vector Double, H.Vector Double)]
> carSamples = evalState (ML.unfoldrM carSample m0) (pureMT 17)


We can plot an example trajectory for the car and the noisy observations that are available to the smoother / filter.

Sadly there is no equivalent to numpy in Haskell. Random number packages generate vectors, for multi-rank arrays there is repa and for fast matrix manipulation there is hmtatrix. Thus for our single step path update function, we have to pass in functions to perform type conversion. Clearly with all the copying inherent in this approach, performance is not going to be great.

The type synonym ArraySmoothing is used to denote the cloud of particles at each time step.

> type ArraySmoothing = Repa.Array Repa.U DIM2

> singleStep :: forall a . U.Unbox a =>
>               (a -> H.Vector Double) ->
>               (H.Vector Double -> a) ->
>               H.Matrix Double ->
>               H.Herm Double ->
>               H.Matrix Double ->
>               H.Herm Double ->
>               ArraySmoothing a -> H.Vector Double ->
>               WriterT [ArraySmoothing a] (StateT PureMT IO) (ArraySmoothing a)
> singleStep f g bigA bigQ bigH bigR x y = do
>   tell[x]
>   let (Z :. ix :. jx) = extent x
>
>   xHatR <- lift $computeP$ Repa.slice x (Any :. jx - 1)
>   let xHatH = map f $Repa.toList (xHatR :: Repa.Array Repa.U DIM1 a) > xTildeNextH <- lift$ mapM (\x -> sample $rvar (Normal (bigA H.#> x) bigQ)) xHatH > > let xTildeNextR = Repa.fromListUnboxed (Z :. ix :. (1 :: Int))$
>                     map g xTildeNextH
>       xTilde = Repa.append x xTildeNextR
>
>       weights = map (normalPdf y bigR) $> map (bigH H.#>) xTildeNextH > vs = runST (create >>= (asGenST$ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList weights)
>       totWeight = sum weights
>       js = indices (V.map (/ totWeight) $V.tail cumSumWeights) vs > xNewV = V.map (\j -> Repa.transpose$
>                            Repa.reshape (Z :. (1 :: Int) :. jx + 1) $> slice xTilde (Any :. j :. All)) js > xNewR = Repa.transpose$ V.foldr Repa.append (xNewV V.! 0) (V.tail xNewV)
>   computeP xNewR


The state for the car is a 4-tuple.

> data SystemState a = SystemState { xPos  :: a
>                                  , yPos  :: a
>                                  , _xSpd :: a
>                                  , _ySpd :: a
>                                  }


We initialize the smoother from some prior distribution.

> initCar :: StateT PureMT IO (ArraySmoothing (SystemState Double))
> initCar = do
>   xTilde1 <- replicateM n $sample$ rvar (Normal m0 bigP0)
>   let weights = map (normalPdf (snd $head carSamples) bigR)$
>                 map (bigH H.#>) xTilde1
>       vs = runST (create >>= (asGenST $\gen -> uniformVector gen n)) > cumSumWeights = V.scanl (+) 0 (V.fromList weights) > js = indices (V.tail cumSumWeights) vs > xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int))$
>               map ((\[a,b,c,d] -> SystemState a b c d) . H.toList) $> V.toList$
>               V.map ((V.fromList xTilde1) V.!) js
>   return xHat1


Now we can run the smoother.

> smootherCar :: StateT PureMT IO
>             (ArraySmoothing (SystemState Double)
>             , [ArraySmoothing (SystemState Double)])
> smootherCar = runWriterT $do > xHat1 <- lift initCar > foldM (singleStep f g bigA bigQ bigH bigR) xHat1 (take 100$ map snd $tail carSamples)  > f :: SystemState Double -> H.Vector Double > f (SystemState a b c d) = H.fromList [a, b, c, d]  > g :: H.Vector Double -> SystemState Double > g = (\[a,b,c,d] -> (SystemState a b c d)) . H.toList  And create inferred positions for the car which we then plot. > testCar :: IO ([Double], [Double]) > testCar = do > states <- snd <$> evalStateT smootherCar (pureMT 24)
>   let xs :: [Repa.Array Repa.D DIM2 Double]
>       xs = map (Repa.map xPos) states
>   sumXs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose xs)
>   let ixs = map extent sumXs
>       sumLastXs = map (* (recip $fromIntegral n))$
>                   zipWith (Repa.!) sumXs (map (\(Z :. x) -> Z :. (x - 1)) ixs)
>   let ys :: [Repa.Array Repa.D DIM2 Double]
>       ys = map (Repa.map yPos) states
>   sumYs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose ys)
>   let ixsY = map extent sumYs
>       sumLastYs = map (* (recip $fromIntegral n))$
>                   zipWith (Repa.!) sumYs (map (\(Z :. x) -> Z :. (x - 1)) ixsY)
>   return (sumLastXs, sumLastYs)


So it seems our smoother does quite well at inferring the state at the latest observation, that is, when it is working as a filter. But what about estimates for earlier times? We should do better as we have observations in the past and in the future. Let’s try with a simpler example and a smaller number of particles.

First we create some samples for our simple 1 dimensional linear Gaussian model.

> bigA1, bigQ1, bigR1, bigH1 :: Double
> bigA1 = 0.5
> bigQ1 = 0.1
> bigR1 = 0.1
> bigH1 = 1.0

> simpleSample :: MonadRandom m =>
>               Double ->
>               m (Maybe ((Double, Double), Double))
> simpleSample xPrev = do
>   xNew <- sample $rvar (R.Normal (bigA1 * xPrev) bigQ1) > yNew <- sample$ rvar (R.Normal (bigH1 * xNew) bigR1)
>   return $Just ((xNew, yNew), xNew)  > simpleSamples :: [(Double, Double)] > simpleSamples = evalState (ML.unfoldrM simpleSample 0.0) (pureMT 17)  Again create a prior. > initSimple :: MonadRandom m => m (ArraySmoothing Double) > initSimple = do > let y = snd$ head simpleSamples
>   xTilde1 <- replicateM n $sample$ rvar $R.Normal y bigR1 > let weights = map (pdf (R.Normal y bigR1))$
>                 map (bigH1 *) xTilde1
>       totWeight = sum weights
>       vs = runST (create >>= (asGenST $\gen -> uniformVector gen n)) > cumSumWeights = V.scanl (+) 0 (V.fromList$ map (/ totWeight) weights)
>       js = indices (V.tail cumSumWeights) vs
>       xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int)) $> V.toList$
>               V.map ((V.fromList xTilde1) V.!) js
>   return xHat1


Now we can run the smoother.

> smootherSimple :: StateT PureMT IO (ArraySmoothing Double, [ArraySmoothing Double])
> smootherSimple = runWriterT $do > xHat1 <- lift initSimple > foldM (singleStep f1 g1 ((1 H.>< 1) [bigA1]) (H.trustSym$ (1 H.>< 1) [bigQ1^2])
>                           ((1 H.>< 1) [bigH1]) (H.trustSym $(1 H.>< 1) [bigR1^2])) > xHat1 > (take 20$ map H.fromList $map return . map snd$ tail simpleSamples)

> f1 :: Double -> H.Vector Double
> f1 a = H.fromList [a]

> g1 :: H.Vector Double -> Double
> g1 = (\[a] -> a) . H.toList


And finally we can look at the paths not just the means of the marginal distributions at the latest observation time.

> testSimple :: IO [[Double]]
> testSimple = do
>   states <- snd <$> evalStateT smootherSimple (pureMT 24) > let path :: Int -> IO (Repa.Array Repa.U DIM1 Double) > path i = computeP$ Repa.slice (last states) (Any :. i :. All)
>   paths <- mapM path [0..n - 1]
>   return $map Repa.toList paths  We can see that at some point in the past all the current particles have one ancestor. The marginals of the smoothing distribution (at some point in the past) have collapsed on to one particle. # Notes ## Helpers for the Inverse CDF That these are helpers for the inverse CDF is delayed to another blog post. > indices :: V.Vector Double -> V.Vector Double -> V.Vector Int > indices bs xs = V.map (binarySearch bs) xs  > binarySearch :: Ord a => > V.Vector a -> a -> Int > binarySearch vec x = loop 0 (V.length vec - 1) > where > loop !l !u > | u <= l = l > | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u > where k = l + (u - l) shiftR 1  ## Multivariate Normal The random-fu package does not contain a sampler or pdf for a multivariate normal so we create our own. This should be added to random-fu-multivariate package or something similar. > normalMultivariate :: H.Vector Double -> H.Herm Double -> RVarT m (H.Vector Double) > normalMultivariate mu bigSigma = do > z <- replicateM (H.size mu) (rvarT R.StdNormal) > return$ mu + bigA H.#> (H.fromList z)
>   where
>     (vals, bigU) = H.eigSH bigSigma
>     lSqrt = H.diag $H.cmap sqrt vals > bigA = bigU H.<> lSqrt  > data family Normal k :: *  > data instance Normal (H.Vector Double) = Normal (H.Vector Double) (H.Herm Double)  > instance Distribution Normal (H.Vector Double) where > rvar (Normal m s) = normalMultivariate m s  > normalPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) => > H.Vector a -> H.Herm a -> H.Vector a -> a > normalPdf mu sigma x = exp$ normalLogPdf mu sigma x

> normalLogPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) =>
>                  H.Vector a -> H.Herm a -> H.Vector a -> a
> normalLogPdf mu bigSigma x = - H.sumElements (H.cmap log (diagonals dec))
>                               - 0.5 * (fromIntegral (H.size mu)) * log (2 * pi)
>                               - 0.5 * s
>   where
>     dec = fromJust $H.mbChol bigSigma > t = fromJust$ H.linearSolve (H.tr dec) (H.asColumn $x - mu) > u = H.cmap (\x -> x * x) t > s = H.sumElements u  > diagonals :: (Storable a, H.Element t, H.Indexable (H.Vector t) a) => > H.Matrix t -> H.Vector a > diagonals m = H.fromList (map (\i -> m H.! i H.! i) [0..n-1]) > where > n = max (H.rows m) (H.cols m)  > instance PDF Normal (H.Vector Double) where > pdf (Normal m s) = normalPdf m s > logPdf (Normal m s) = normalLogPdf m s  ## Misellaneous > derivingUnbox "SystemState" > [t| forall a . (U.Unbox a) => SystemState a -> (a, a, a, a) |] > [| \ (SystemState x y xdot ydot) -> (x, y, xdot, ydot) |] > [| \ (x, y, xdot, ydot) -> SystemState x y xdot ydot |]  > instance Pretty a => Pretty (SystemState a) where > pPrint (SystemState x y xdot ydot ) = pPrint (x, y, xdot, ydot)  # Bibliography Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. New York, NY, USA: Cambridge University Press. # Floating Point: A Faustian Bargain? Every so often, someone bitten by floating point arithmetic behaving in an unexpected way is tempted to suggest that a calculation should be done be precisely and rounding done at the end. With floating point rounding is done at every step. Here’s an example of why floating point might really be the best option for numerical calculations. Suppose you wish to find the roots of a quintic equation. > import Numeric.AD > import Data.List > import Data.Ratio  > p :: Num a => a -> a > p x = x^5 - 2*x^4 - 3*x^3 + 3*x^2 - 2*x - 1  We can do so using Newton-Raphson using automatic differentiation to calculate the derivative (even though for polynomials this is trivial). > nr :: Fractional a => [a] > nr = unfoldr g 0 > where > g z = let u = z - (p z) / (h z) in Just (u, u) > h z = let [y] = grad (\[x] -> p x) [z] in y  After 7 iterations we see the size of the denominator is quite large (33308 digits) and the calculation takes many seconds. ghci> length$ show $denominator (nr!!7) 33308  On the other hand if we use floating point we get an answer accurate to 1 in $2^{53}$ after 7 iterations very quickly. ghci> mapM_ putStrLn$ map show $take 7 nr -0.5 -0.3368421052631579 -0.31572844839628944 -0.31530116270327685 -0.31530098645936266 -0.3153009864593327 -0.3153009864593327  The example is taken from here who refers the reader to Nick Higham’s book: Accuracy and Stability of Numerical Algorithms. Of course we should check we found a right answer. ghci> p$ nr!!6
0.0


# Introduction

We previously used importance sampling in the case where we did not have a sampler available for the distribution from which we wished to sample. There is an even more compelling case for using importance sampling.

Suppose we wish to estimate the probability of a rare event. For example, suppose we wish to estimate $\mathbb{P}(X > 5)$ where $X \sim {\mathcal{N}}(0,1)$. In this case, we can look up the answer $\mathbb{P}(X > 5) \approx 2.86710^{-7}$. But suppose we couldn’t look up the answer. One strategy that might occur to us is to sample and then estimate the probability by counting the number of times out of the total that the sample was bigger than 5. The flaw in this is obvious but let’s try it anyway.

> module Girsanov where

> import qualified Data.Vector as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.Number.Erf
> import Data.List ( transpose )

> samples :: (Foldable f, MonadRandom m) =>
>                     (Int -> RVar Double -> RVar (f Double)) ->
>                     Int ->
>                     m (f Double)
> samples repM n = sample $repM n$ stdNormal

> biggerThan5 :: Int
> biggerThan5 = length (evalState xs (pureMT 42))
>   where
>     xs :: MonadRandom m => m [Double]
>     xs = liftM (filter (>= 5.0)) samples replicateM 100000  As we might have expected, even if we draw 100,000 samples, we estimate this probability quite poorly. ghci> biggerThan5 0  Using importance sampling we can do a lot better. Let $\xi$ be a normally distributed random variable with zero mean and unit variance under the Lebesgue measure $\mathbb{P}$. As usual we can then define a new probability measure, the law of $\xi$, by \displaystyle \begin{aligned} \mathbb{P}_\xi((-\infty, b]) &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^b e^{-x^2/2}\,\mathrm{d}x \end{aligned} Thus \displaystyle \begin{aligned} \mathbb{E}_\xi(f) &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty f(x) e^{-x^2/2}\,\mathrm{d}x \\ &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty f(x) e^{a^2/2}e^{-a x}e^{-(x-a)^2/2}\,\mathrm{d}x \\ &= \mathbb{E}_{\xi + a}(fg) \\ &= \mathbb{\tilde{E}}_{\xi + a}(f) \end{aligned} where we have defined $\displaystyle g(x) \triangleq e^{a^2/2}e^{-a x} \quad \mathrm{and} \quad \mathbb{\tilde{P}}((-\infty, b]) \triangleq \int_{-\infty}^b g(x)\,\mathrm{d}x$ Thus we can estimate $\mathbb{P}(X > 5)$ either by sampling from a normal distribution with mean 0 and counting the number of samples that are above 5 or we can sample from a normal distribution with mean 5 and calculating the appropriately weighted mean $\displaystyle \frac{1}{n}\sum_{i=1}^n \mathbb{I}_{\{x > 5\}}g(y)$ Let’s try this out. > biggerThan5' :: Double > biggerThan5' = sum (evalState xs (pureMT 42)) / (fromIntegral n) > where > xs :: MonadRandom m => m [Double] > xs = liftM (map g)
>          liftM (filter (>= 5.0)) $> liftM (map (+5))$
>          samples replicateM n
>     g x = exp $(5^2 / 2) - 5 * x > n = 100000  And now we get quite a good estimate. ghci> biggerThan5' 2.85776225450217e-7  ## Random Paths The probability of another rare event we might wish to estimate is that of Brownian Motion crossing a boundary. For example, what is the probability of Browian Motion crossing the line $y = 3.5$? Let’s try sampling 100 paths (we restrict the number so the chart is still readable). > epsilons :: (Foldable f, MonadRandom m) => > (Int -> RVar Double -> RVar (f Double)) -> > Double -> > Int -> > m (f Double) > epsilons repM deltaT n = sample$ repM n $rvar (Normal 0.0 (sqrt deltaT))  > bM0to1 :: Foldable f => > ((Double -> Double -> Double) -> Double -> f Double -> f Double) > -> (Int -> RVar Double -> RVar (f Double)) > -> Int > -> Int > -> f Double > bM0to1 scan repM seed n = > scan (+) 0.0$
>   evalState (epsilons repM (recip $fromIntegral n) n) (pureMT (fromIntegral seed))  We can see by eye in the chart below that again we do quite poorly. We know that $\mathbb{P}(T_a \leq t) = 2(1 - \Phi (a / \sqrt{t}))$ where $T_a = \inf \{t : W_t = a\}$. > p :: Double -> Double -> Double > p a t = 2 * (1 - normcdf (a / sqrt t))  ghci> p 1.0 1.0 0.31731050786291415 ghci> p 2.0 1.0 4.550026389635842e-2 ghci> p 3.0 1.0 2.699796063260207e-3  But what if we didn’t know this formula? Define $\displaystyle N(\omega) \triangleq \begin{cases} 1 & \text{if } \sup_{0 \leq t \leq 1}\tilde W_t \geq a \\ 0 & \text{if } \sup_{0 \leq t \leq 1}\tilde W_t < a \\ \end{cases}$ where $\mathbb{Q}$ is the measure which makes $\tilde W_t$ Brownian Motion. We can estimate the expectation of $N$ $\displaystyle \hat p_{\mathbb{Q}} = \frac{1}{M}\sum_{i=1}^H n_i$ where $n_i$ is 1 if Brownian Motion hits the barrier and 0 otherwise and M is the total number of simulations. We know from visual inspection that this gives poor results but let us try some calculations anyway. > n = 500 > m = 10000  > supAbove :: Double -> Double > supAbove a = fromIntegral count / fromIntegral n > where > count = length$
>             filter (>= a) $> map (\seed -> maximum$ bM0to1 scanl replicateM seed m) [0..n - 1]

> bM0to1WithDrift seed mu n =
>   zipWith (\m x -> x + mu * m * deltaT) [0..] $> bM0to1 scanl replicateM seed n > where > deltaT = recip$ fromIntegral n

ghci> supAbove 1.0
0.326

ghci> supAbove 2.0
7.0e-2

ghci> supAbove 3.0
0.0


As expected for a rare event we get an estimate of 0.

Fortunately we can use importance sampling for paths. If we take $\mu(\omega, t) = a$ where $a$ is a constant in Girsanov’s Theorem below then we know that $\tilde W_t = W_t + \int_0^t a \,\mathrm{d}s = W_t + at$ is $\mathbb{Q}$-Brownian Motion.

We observe that

\displaystyle \begin{aligned} \mathbb{Q}N &= \mathbb{P}\bigg(N\frac{\mathrm{d} \mathbb{Q}}{\mathrm{d} \mathbb{P}}\bigg) \\ &= \mathbb{P}\Bigg[N \exp \Bigg(-\int_0^1 \mu(\omega,t) \,\mathrm{d}W_t - \frac{1}{2} \int_0^1 \mu^2(\omega, t) \,\mathrm{d} t\Bigg) \Bigg] \\ &= \mathbb{P}\Bigg[N \exp \Bigg(-aW_1 - \frac{1}{2} a^2\Bigg) \Bigg] \end{aligned}

So we can also estimate the expectation of $N$ under $\mathbb{P}$ as

$\displaystyle \hat p_{\mathbb{P}} = \frac{1}{M}\sum_{i=1}^H n_i\exp{\bigg(-aw^{(1)}_i - \frac{a^2}{2}\bigg)}$

where $n_i$ is now 1 if Brownian Motion with the specified drift hits the barrier and 0 otherwise, and $w^{(1)}_i$ is Brownian Motion sampled at $t=1$.

We can see from the chart below that this is going to be better at hitting the required barrier.

Let’s do some calculations.

> supAbove' a = (sum $zipWith (*) ns ws) / fromIntegral n > where > deltaT = recip$ fromIntegral m
>
>     uss = map (\seed -> bM0to1 scanl replicateM seed m) [0..n - 1]
>     ys = map last uss
>     ws = map (\x -> exp (-a * x - 0.5 * a^2)) ys
>
>     vss = map (zipWith (\m x -> x + a * m * deltaT) [0..]) uss
>     sups = map maximum vss
>     ns = map fromIntegral $map fromEnum$ map (>=a) sups

ghci> supAbove' 1.0
0.31592655955519156

ghci> supAbove' 2.0
4.999395029856741e-2

ghci> supAbove' 3.0
2.3859203473651654e-3


The reader is invited to try the above estimates with 1,000 samples per path to see that even with this respectable number, the calculation goes awry.

## In General

If we have a probability space $(\Omega, {\mathcal{F}}, \mathbb{P})$ and a non-negative random variable $Z$ with $\mathbb{E}Z = 1$ then we can define a new probability measure $\mathbb{Q}$ on the same $\sigma$-algebra by

$\displaystyle \mathbb{Q} A \triangleq \int_A Z \,\mathrm{d} \mathbb{P}$

For any two probability measures when such a $Z$ exists, it is called the Radon-Nikodym derivative of $\mathbb{Q}$ with respect to $\mathbb{P}$ and denoted $\frac{\mathrm{d} \mathbb{Q}}{\mathrm{d} \mathbb{P}}$

Given that we managed to shift a Normal Distribution with non-zero mean in one measure to a Normal Distribution with another mean in another measure by producing the Radon-Nikodym derivative, might it be possible to shift, Brownian Motion with a drift under a one probability measure to be pure Brownian Motion under another probability measure by producing the Radon-Nikodym derivative? The answer is yes as Girsanov’s theorem below shows.

# Girsanov’s Theorem

Let $W_t$ be Brownian Motion on a probability space $(\Omega, {\mathcal{F}}, \mathbb{P})$ and let $\{{\mathcal{F}}_t\}_{t \in [0,T]}$ be a filtration for this Brownian Motion and let $\mu(\omega, t)$ be an adapted process such that the Novikov Sufficiency Condition holds

$\displaystyle \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^T \mu^2(s, \omega) \,\mathrm{d}s\bigg)}\bigg] = K < \infty$

then there exists a probability measure $\mathbb{Q}$ such that

• $\mathbb{Q}$ is equivalent to $\mathbb{P}$, that is, $\mathbb{Q}(A) = 0 \iff \mathbb{P}(A) = 0$.

• $\displaystyle {\frac{\mathrm{d}\mathbb{Q}}{\mathrm{d}\mathbb{P}} = \exp \Bigg(-\int_0^T \mu(\omega,t) \,\mathrm{d}W_t - \frac{1}{2} \int_0^T \mu^2(\omega, t) \,\mathrm{d} t\Bigg)}$.

• $\tilde W_t = W_t + \int_0^t \mu(\omega, t) \,\mathrm{d}s$ is Brownian Motion on the probabiity space $(\Omega, {\mathcal{F}}, \mathbb{Q})$ also with the filtration $\{\mathcal{F}_t\}_{t \in [0,T]}$.

In order to prove Girsanov’s Theorem, we need a condition which allows to infer that $M_t(\mu)$ is a strict martingale. One such useful condition to which we have already alluded is the Novikov Sufficiency Condition.

## Proof

Define $\mathbb{Q}$ by

$\displaystyle \mathbb{Q}(A) = \mathbb{P}(1_A M_T) \quad \mathrm{where} \quad M_t(\mu) = \exp{\bigg(\int_0^t - \mu(t, \omega) \,\mathrm{d}W_s - \frac{1}{2}\int_0^t \mu^2(t, \omega) \,\mathrm{d}s\bigg)}$

Then, temporarily overloading the notation and writing $\mathbb{P}$ for expectation under $\mathbb{P}$, and applying the Novikov Sufficiency Condition to $f(s) - \mu(\omega ,s)$, we have

\displaystyle \begin{aligned} \mathbb{Q}\bigg[\exp{\int_0^T f(s) \,\mathrm{d}X_s}\bigg] &= \mathbb{Q}\bigg[\exp{\int_0^T f(s) \,\mathrm{d}W_s + \int_0^T \mu(\omega, s) \,\mathrm{d}s}\bigg] \\ &= \mathbb{P}\bigg[\exp{\bigg( \int_0^T \big(f(s) - \mu(\omega, s)\big)\,\mathrm{d}W_s + \int_0^T f(s)\mu(\omega, s)\,\mathrm{d}s - \frac{1}{2}\int_0^T \mu^2(\omega ,s) \,\mathrm{d}s \bigg)}\bigg] \\ &= \mathbb{P}\bigg[\exp{\bigg( \int_0^T \big(f(s) - \mu(\omega, s)\big)\,\mathrm{d}W_s - \frac{1}{2}\int_0^T \big(f(s) - \mu(\omega ,s)\big)^2 \,\mathrm{d}s + \frac{1}{2}\int_0^T f^2(s) \,\mathrm{d}s \bigg)}\bigg] \\ &= \frac{1}{2}\int_0^T f^2(s) \,\mathrm{d}s \, \mathbb{P}\bigg[\exp{\bigg( \int_0^T \big(f(s) - \mu(\omega, s)\big)\,\mathrm{d}W_s - \frac{1}{2}\int_0^T \big(f(s) - \mu(\omega ,s)\big)^2 \,\mathrm{d}s \bigg)}\bigg] \\ &= \frac{1}{2}\int_0^T f^2(s) \,\mathrm{d}s \end{aligned}

From whence we see that

$\displaystyle \mathbb{Q}\big(e^{i \zeta (X_t - X_s)}\big) = e^{-\frac{1}{2} \zeta^2 (t - s)}$

And since this characterizes Brownian Motion, we are done.

$\blacksquare$

# The Novikov Sufficiency Condition

## The Novikov Sufficiency Condition Statement

Let $\mu \in {\cal{L}}^2_{\mathrm{LOC}}[0,T]$ and further let it satisfy the Novikov condition

$\displaystyle \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^T \mu^2(s, \omega) \,\mathrm{d}s\bigg)}\bigg] = K < \infty$

then the process defined by

$\displaystyle M_t(\mu) = \exp{\bigg(\int_0^t \mu(t, \omega) \,\mathrm{d}W_s - \frac{1}{2}\int_0^t \mu^2(t, \omega) \,\mathrm{d}s\bigg)}$

is a strict martingale.

Before we prove this, we need two lemmas.

## Lemma 1

Let $M_t$ for $t \in [0,t]$ be a non-negative local martingale then $M_t$ is a super-martingale and if further $\mathbb{E}M_T = \mathbb{E}M_0$ then $M_t$ is a strict martingale.

Proof

Let $\{\tau_n\}_{n \in \mathbb{N}}$ be a localizing sequence for $M_t$ then for $0 < s < t < T$ and using Fatou’s lemma and the fact that the stopped process is a strict martingale

$\displaystyle \mathbb{E}(M_t \,|\, {\mathcal{F}_s}) = \mathbb{E}(\liminf_{n \rightarrow \infty} M_{t \land \tau_m} \,|\, {\mathcal{F}_s}) \leq \liminf_{n \rightarrow \infty} \mathbb{E}(M_{t \land \tau_m} \,|\, {\mathcal{F}_s}) = \liminf_{n \rightarrow \infty} M_{s \land \tau_m} = M_s$

Thus $M_t$ is a super-martingale and therefore

$\displaystyle \mathbb{E}M_T \leq \mathbb{E}M_t \leq \mathbb{E}M_s \leq \mathbb{E}M_0$

By assumption we have $\mathbb{E}M_T \leq \mathbb{E}M_0$ thus $M_t$ is a strict martingale.

$\blacksquare$

## Lemma 2

Let $M_t$ be a non-negative local martingale. If $\{\tau_n\}_{n \in \mathbb{N}}$ is a localizing sequence such that $\sup_n \|M_{T \land \tau_n}\|_p < \infty$ for some $p > 1$ then $M_t$ is a strict martingale.

Proof

$\displaystyle \mathbb{E}(|M_T - M_{T \land \tau_n}|) \leq \mathbb{E}(|M_T - r \land M_T) + \mathbb{E}(|r \land M_T - r \land M_{T \land \tau_n}|) + \mathbb{E}(M_{T \land \tau_n} - r \land M_{T \land \tau_n})$

By the super-martingale property $\mathbb{E}(M_T) \leq \mathbb{E}(M_0) < \infty$ and thus by dominated convergence we have that

$\displaystyle \lim_{r \rightarrow \infty} \mathbb{E}(r \land M_T) = \mathbb{E}(M_T) \quad \mathrm{and} \quad \lim_{r \rightarrow \infty}\lim_{n \rightarrow \infty}\mathbb{E}(|r \land M_T - r \land M_{T \land \tau_n}|) = 0$

We also have that

\displaystyle \begin{aligned} \mathbb{E}(M_{T \land \tau_n} - r \land M_{T \land \tau_n}) &= \mathbb{E}((M_{T \land \tau_n} - r \land M_{T \land \tau_n}){I}_{(M_{T \land \tau_n} > r)}) + \mathbb{E}((M_{T \land \tau_n} - r \land M_{T \land \tau_n}){I}_{(M_{T \land \tau_n} \leq r)}) \\ &= \mathbb{E}((M_{T \land \tau_n} - r \land M_{T \land \tau_n}){I}_{(M_{T \land \tau_n} > r)}) \\ &= \mathbb{E}(M_{T \land \tau_n}{I}_{(M_{T \land \tau_n} > r)}) - r\mathbb{P}({M_{T \land \tau_n} > r}) \end{aligned}

By Chebyshev’s inequality (see note (2) below), we have

$\displaystyle r\mathbb{P}({M_{T \land \tau_n} > r}) \leq \frac{r\mathbb{E}|X|^p}{r^p} \leq \frac{\sup_n{\mathbb{E}(M_{T \land \tau_n})^p}}{r^{p-1}}$

Taking limits first over $n \rightarrow \infty$ and then over $r \rightarrow \infty$ we see that

$\displaystyle \lim_{r \rightarrow \infty}\lim_{n \rightarrow \infty} r\mathbb{P}({M_{T \land \tau_n} > r}) \rightarrow 0$

For $0 \leq r \leq x$ and $p > 1$ we have $x \leq r^{1-p}x^p$. Thus

$\displaystyle \mathbb{E}(M_{T \land \tau_n}{I}_{(M_{T \land \tau_n} > r)}) \leq r^{1-p}\mathbb{E}(M_{T \land \tau_n}^p{I}_{(M_{T \land \tau_n} > r)}) \leq r^{1-p}\sup_n(M_{T \land \tau_n}^p)$

Again taking limits over $n \rightarrow \infty$ and then over $r \rightarrow \infty$ we have

$\displaystyle \lim_{r \rightarrow \infty}\lim_{n \rightarrow \infty} \mathbb{E}(M_{T \land \tau_n}{I}_{(M_{T \land \tau_n} > r)}) \rightarrow 0$

These two conclusions imply

$\displaystyle \lim_{r \rightarrow \infty}\lim_{n \rightarrow \infty} \mathbb{E}(M_{T \land \tau_n} - r \land M_{T \land \tau_n}) \rightarrow 0$

We can therefore conclude (since $M_{T \land \tau_n}$ is a martingale)

$\displaystyle \mathbb{E}(M_T) = \lim_{n \rightarrow \infty}\mathbb{E}(M_{T \land \tau_n}) = \mathbb{E}(M_0)$

Thus by the preceeding lemma $M_t$ is a strict as well as a local martingale.

$\blacksquare$

## The Novikov Sufficiency Condition Proof

### Step 1

First we note that $M_t(\lambda\mu)$ is a local martingale for $0 < \lambda < 1$. Let us show that it is a strict martingale. We can do this if for any localizing sequence $\{\tau_n\}_{n \in \mathbb{N}}$ we can show

$\displaystyle \sup_n\mathbb{E}(M_{T \land \tau_n}(\lambda\mu))^p < \infty$

using the preceeding lemma where $p > 1$.

We note that

\displaystyle \begin{aligned} M_t(\lambda\mu) &= \exp{\bigg(\int^t_0 \lambda\mu(\omega, s)\,\mathrm{d}W_s - \frac{1}{2}\int^t_0 \lambda^2\mu^2(\omega, s)\,\mathrm{d}s\bigg)} \\ &= {(M_t(\mu))}^{\lambda^2}\exp{\bigg((\lambda - \lambda^2)\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)} \end{aligned}

Now apply Hölder’s inequality with conjugates $({p\lambda^2})^{-1}$ and $({1 - p\lambda^2})^{-1}$ where $p$ is chosen to ensure that the conjugates are both strictly greater than 1 (otherwise we cannot apply the inequality).

\displaystyle \begin{aligned} \mathbb{E}((M_t(\lambda\mu))^p) &= \mathbb{E}\bigg[{(M_t(\mu))}^{p\lambda^2}\exp{\bigg(p(\lambda - \lambda^2)\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg] \\ &\le \bigg|\bigg|{M_t(\mu)}^{p\lambda^2}\bigg|\bigg|_{p\lambda^2} \bigg|\bigg|\exp{\bigg(p(\lambda - \lambda^2)\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg|\bigg|_{1 - p\lambda^2} \\ &= \mathbb{E}{\bigg[M_t(\mu)}\bigg]^{p\lambda^2} \mathbb{E}\bigg[\exp{\bigg(p\frac{\lambda - \lambda^2}{1 - p\lambda^2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg]^{1 - p\lambda^2} \end{aligned}

Now let us choose

$\displaystyle p\frac{\lambda - \lambda^2}{1 - p\lambda^2} = \frac{1}{2}$

then

\displaystyle \begin{aligned} 2p(\lambda - \lambda^2) &= 1 - p\lambda^2 \\ p & = \frac{1}{2(\lambda - \lambda^2) + \lambda^2} \\ p &= \frac{1}{(2 - \lambda)\lambda} \end{aligned}

In order to apply Hölder’s inequality we need to check that $(p\lambda^2)^{-1} > 1$ and that $(1 - p\lambda^2)^{-1} > 1$ but this amounts to checking that $p\lambda^2 > 0$ and that $1 > \lambda$. We also need to check that $p > 0$ but this amounts to checking that $(2 - \lambda)\lambda < 1$ for $0 < \lambda < 1$ and this is easily checked to be true.

Re-writing the above inequality with this value of $p$ we have

\displaystyle \begin{aligned} \mathbb{E}((M_t(\lambda\mu))^p) &\le \mathbb{E}{\bigg[M_t(\mu)}\bigg]^{p\lambda^2} \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg]^{1 - p\lambda^2} \end{aligned}

By the first lemma, since $M_t(\mu)$ is a non-negative local martingale, it is also a supermartingale. Furthermore $\mathbb{E}(M_0(\mu)) = 1$. Thus

$\displaystyle \mathbb{E}{\bigg[M_t(\mu)}\bigg]^{p\lambda^2} \leq 1$

and therefore

\displaystyle \begin{aligned} \mathbb{E}((M_t(\lambda\mu))^p) &\le \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg]^{1 - p\lambda^2} \end{aligned}

### Step 2

Recall we have

$\displaystyle {M_t} = \exp\bigg( \int_0^t \mu(\omega ,s)\,\mathrm{d}W_s - \frac{1}{2}\int_0^t \mu(\omega ,s)\,\mathrm{d}s \bigg)$

Taking logs gives

$\displaystyle \log{M_t} = \int_0^t \mu(\omega ,s)\,\mathrm{d}W_s - \frac{1}{2}\int_0^t \mu(\omega ,s)^2\,\mathrm{d}s$

or in diferential form

$\displaystyle \mathrm{d}(\log{M_t}) = \mu(\omega ,t)\,\mathrm{d}W_t - \frac{1}{2}\mu(\omega ,t)^2\,\mathrm{d}t$

We can also apply Itô’s rule to $\log{M_t}$

\displaystyle \begin{aligned} \mathrm{d}(\log{M_t}) &= \frac{1}{M_t}\,\mathrm{d}M_t - \frac{1}{2}\frac{1}{M_t^2}\,\mathrm{d}[M]_t \\ \end{aligned}

where $[\ldots]$ denotes the quadratic variation of a stochastic process.

Comparing terms gives the stochastic differential equation

$\displaystyle \mathrm{d}M_t = M_t\mu(\omega,t)\,\mathrm{d}W_t$

In integral form this can also be written as

$\displaystyle M_t = 1 + \int_0^t M_s\mu(\omega, s)\,\mathrm{d}W_s$

Thus $M_t$ is a local martingale (it is defined by a stochastic differential equation) and by the first lemma it is a supermaringale. Hence $\mathbb{E}M_t \leq 1$.

Next we note that

$\displaystyle \exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t)\bigg)} = \exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t) - \frac{1}{4}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s\bigg)} \exp{\bigg(\frac{1}{4}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s\bigg)}$

to which we can apply Hölder’s inequality with conjugates $p = q = 2$ to obtain

\displaystyle \begin{aligned} \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t)\bigg)}\bigg] &= \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t) - \frac{1}{4}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s \bigg)} \exp{\bigg(\frac{1}{4}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s \bigg)}\bigg] \\ & \leq \sqrt{\mathbb{E}\bigg[\exp{\bigg(\int_0^t \mu(\omega, t) - \frac{1}{2}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s \bigg)}\bigg]} \sqrt{\mathbb{E}\exp{\bigg(\frac{1}{2}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s \bigg)}\bigg]} \end{aligned}

Applying the supermartingale inequality then gives

\displaystyle \begin{aligned} \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t)\bigg)}\bigg] & \leq \sqrt{\mathbb{E}\exp{\bigg(\frac{1}{2}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s \bigg)}\bigg]} \end{aligned}

### Step 3

Now we can apply the result in Step 2 to the result in Step 1.

\displaystyle \begin{aligned} \mathbb{E}((M_t(\lambda\mu))^p) &\le \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg]^{1 - p\lambda^2} \\ &\le {\mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s \bigg)}\bigg]}^{(1 - p\lambda^2)/2} \\ &\le K^{(1 - p\lambda^2)/2} \end{aligned}

We can replace $M_t$ by $M_t {\mathcal{I}}_{t < \tau}$ for any stopping time $\tau$. Thus for a localizing sequence we have

\displaystyle \begin{aligned} \mathbb{E}((M_{t \land \tau_n}(\lambda\mu))^p) &\le K^{(1 - p\lambda^2)/2} \end{aligned}

From which we can conclude

$\displaystyle \sup_n \|M_{T \land \tau_n}(\lambda\mu)\|_p < \infty$

Now we can apply the second lemma to conclude that $M_{T \land \tau_n}(\lambda\mu)$ is a strict martingale.

### Final Step

\displaystyle \begin{aligned} M_t(\lambda\mu) &= \exp{\bigg(\int^t_0 \lambda\mu(\omega, s)\,\mathrm{d}W_s - \frac{1}{2}\int^t_0 \lambda^2\mu^2(\omega, s)\,\mathrm{d}s\bigg)} \\ &= {(M_t(\mu))}^{\lambda^2}\exp{\bigg((\lambda - \lambda^2)\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)} \end{aligned}

Now apply Hölder’s inequality with conjugates $p = \lambda^{-2}$ and $q = (1 - \lambda^2)^{-1}$.

$\displaystyle 1 = \mathbb{E}(M_t(\lambda\mu) \le \mathbb{E}(M_t(\mu))^{\lambda^2}\mathbb{E}{\bigg(}\exp{\bigg(\frac{\lambda}{1 + \lambda}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg)^{1 - \lambda^2}$

And then we can apply Jensen’s inequality to the last term on the right hand side with the convex function $x^{(1 + \lambda)/2\lambda}$.

$\displaystyle 1 \le \mathbb{E}(M_t(\mu))^{\lambda^2} \mathbb{E}{\bigg(}\exp{\bigg(\frac{1}{2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg)^{2\lambda(1- \lambda)}$

Using the inequality we established in Step 2 and the Novikov condition then gives

$\displaystyle 1 \le \mathbb{E}(M_t(\mu))^{\lambda^2} K^{\lambda(1 - \lambda)}$

If we now let $\lambda \nearrow 1$ we see that we must have $1 \le \mathbb{E}(M_t(\mu))$. We already now that $1 \ge \mathbb{E}(M_t(\mu))$ by the first lemma and so we have finally proved that $M_t(\mu)$ is a martingale.

# Notes

1. We have already used importance sampling and also touched on changes of measure.

2. Chebyshev’s inequality is usually stated for the second moment but the proof is easily adapted:

$\displaystyle \mathbb P( |X| > u ) = \int 1_{|X| > u} ~d\mathbb P = \frac 1 {u^p} \int u^p 1_{|X| > u} ~d\mathbb P < \frac 1 {u^p} \int |X|^p 1_{|X| > u} ~ d\mathbb P \le \frac 1 {u^p} \mathbb E|X|^p.$

1. We follow Handel (2007); a similar approach is given in Steele (2001).

# Bibliography

Handel, Ramon von. 2007. “Stochastic Calculus, Filtering, and Stochastic Control (Lecture Notes).”

Steele, J.M. 2001. Stochastic Calculus and Financial Applications. Applications of Mathematics. Springer New York. https://books.google.co.uk/books?id=fsgkBAAAQBAJ.

# Introduction

Suppose we wish to model a process described by a differential equation and initial condition

\displaystyle \begin{aligned} \dot{x}(t) &= a(x, t) \\ x(0) &= a_0 \end{aligned}

But we wish to do this in the presence of noise. It’s not clear how do to this but maybe we can model the process discretely, add noise and somehow take limits.

Let $\pi = \{0 = t_0 \leq t_1 \leq \ldots \leq t_n = t\}$ be a partition of $[0, t]$ then we can discretise the above, allow the state to be random and add in some noise which we model as samples of Brownian motion at the selected times multiplied by $b$ so that we can vary the amount noise depending on the state. We change the notation from $x$ to $X(\omega)$ to indicate that the variable is now random over some probability space.

\displaystyle \begin{aligned} {X}(t_{i+1}, \omega) - {X}(t_i, \omega) &= a({X}(t_i, \omega))(t_{i+1} - t_i) + b({X}(t_i, \omega))(W(t_{i+1}, \omega) - W(t_i, \omega)) \\ X(t_0, \omega) &= A_{0}(\omega) \end{aligned}

We can suppress explicit mention of $\omega$ and use subscripts to avoid clutter.

\displaystyle \begin{aligned} {X}_{t_{i+1}} - {X}_{t_i} &= a({X}_{t_i})(t_{i+1} - t_i) + b({X}_{t_i})(W_{t_{i+1}} - W_{t_i}) \\ X(t_0) &= A_{0}(\omega) \end{aligned}

We can make this depend continuously on time specifying that

$\displaystyle X_t = X_{t_i} \quad \mathrm{for} \, t \in (t_i, t_{i+1}]$

and then telescoping to obtain

\displaystyle \begin{aligned} {X}_{t} &= X_{t_0} + \sum_{i=0}^{k-1} a({X}_{t_i})(t_{i+1} - t_i) + \sum_{i=0}^{k-1} b({X}_{t_i})(W_{t_{i+1}} - W_{t_i}) \quad \mathrm{for} \, t \in (t_k, t_{k+1}] \end{aligned}

In the limit, the second term on the right looks like an ordinary integral with respect to time albeit the integrand is stochastic but what are we to make of the the third term? We know that Brownian motion is nowhere differentiable so it would seem the task is impossible. However, let us see what progress we can make with so-called simple proceses.

# Simple Processes

Let

$\displaystyle X(t,\omega) = \sum_{i=0}^{k-1} B_i(\omega)\mathbb{I}_{(t_i, t_{i+1}]}(t)$

where $B_i$ is ${\cal{F}}(t_i)$-measurable. We call such a process simple. We can then define

$\displaystyle \int_0^\infty X_s \mathrm{d}W_s \triangleq \sum_{i=0}^{k-1} B_i{(W_{t_{i+1}} - W_{t_{i+1}})}$

So if we can produce a sequence of simple processes, $X_n$ that converge in some norm to $X$ then we can define

$\displaystyle \int_0^\infty X(s)\mathrm{d}W(s) \triangleq \lim_{n \to \infty}\int_0^\infty X_n(s)\mathrm{d}W(s)$

Of course we need to put some conditions of the particular class of stochastic processes for which this is possible and check that the limit exists and is unique.

We consider the ${\cal{L}}^2(\mu \times \mathbb{P})$, the space of square integrable functions with respect to the product measure $\mu \otimes \mathbb{P}$ where $\mu$ is Lesbegue measure on ${\mathbb{R}^+}$ and $\mathbb{P}$ is some given probability measure. We further restrict ourselves to progressively measurable functions. More explicitly, we consider the latter class of stochastic processes such that

$\displaystyle \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s < \infty$

# Less Simple Processes

## Bounded, Almost Surely Continuous and Progressively Adapted

Let $X$ be a bounded, almost surely continuous and progressively measurable process which is (almost surely) $0$ for $t > T$ for some positive constant $T$. Define

$\displaystyle X_n(t, \omega) \triangleq X\bigg(T\frac{i}{n}, \omega\bigg) \quad \mathrm{for} \quad T\frac{i}{n} \leq t < T\frac{i + 1}{n}$

These processes are cleary progressively measurable and by bounded convergence ($X$ is bounded by hypothesis and $\{X_n\}_{n=0,\ldots}$ is uniformly bounded by the same bound).

$\displaystyle \lim_{n \to \infty}\|X - X_n\|_2 = 0$

## Bounded and Progressively Measurable

Let $X$ be a bounded and progressively measurable process which is (almost surely) $0$ for $t > T$ for some positive constant $T$. Define

$\displaystyle X_n(t, \omega) \triangleq \frac{1}{1/n}\int_{t-1/n}^t X(s, \omega) \,\mathrm{d}s$

Then $X^n(s, \omega)$ is bounded, continuous and progressively measurable and it is well known that $X^n(t, \omega) \rightarrow X(t, \omega)$ as $n \rightarrow 0$. Again by bounded convergence

$\displaystyle \lim_{n \to \infty}\|X - X_n\|_2 = 0$

## Progressively Measurable

Firstly, let $X$ be a progressively measurable process which is (almost surely) $0$ for $t > T$ for some positive constant $T$. Define $X_n(t, \omega) = X(t, \omega) \land n$. Then $X_n$ is bounded and by dominated convergence

$\displaystyle \lim_{n \to \infty}\|X - X_n\|_2 = 0$

Finally let $X$ be a progressively measurable process. Define

$\displaystyle X_n(t, \omega) \triangleq \begin{cases} X(t, \omega) & \text{if } t \leq n \\ 0 & \text{if } \mathrm{otherwise} \end{cases}$

Clearly

$\displaystyle \lim_{n \to \infty}\|X - X_n\|_2 = 0$

# The Itô Isometry

Let $X$ be a simple process such that

$\displaystyle \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s < \infty$

then

$\displaystyle \mathbb{E}\bigg(\int_0^\infty X_s\,\mathrm{d}W_s\bigg)^2 = \mathbb{E}\bigg(\sum_{i=0}^{k-1} B_i{(W_{t_{i+1}} - W_{t_{i}})}\bigg)^2 = \sum_{i=0}^{k-1} \mathbb{E}(B_i)^2({t_{i+1}} - {t_{i}}) = \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s$

Now suppose that $\{H_n\}_{n \in \mathbb{N}}$ is a Cauchy sequence of progressively measurable simple functions in ${\cal{L}}^2(\mu \times \mathbb{P})$ then since the difference of two simple processes is again a simple process we can apply the Itô Isometry to deduce that

$\displaystyle \lim_{m,n \to \infty}\mathbb{E}\bigg(\int_0^\infty (H_n(s) - H_m(s))\,\mathrm{d}W(s)\bigg)^2 = \lim_{m,n \to \infty}\mathbb{E}\int_0^\infty (H_n(s) - H_m(s))^2\,\mathrm{d}s = 0$

In other words, $\int_0^\infty H_n(s)\,\mathrm{d}W(s)$ is also Cauchy in ${\cal{L}}^2(\mathbb{P})$ and since this is complete, we can conclude that

$\displaystyle \int_0^\infty X(s)\mathrm{d}W(s) \triangleq \lim_{n \to \infty}\int_0^\infty X_n(s)\mathrm{d}W(s)$

exists (in ${\cal{L}}^2(\mathbb{P})$). Uniqueness follows using the triangle inequality and the Itô isometry.

# Notes

1. We defer proving the definition also makes sense almost surely to another blog post.

2. This approach seems fairly standard see for example Handel (2007) and Mörters et al. (2010).

3. Rogers and Williams (2000) takes a more general approach.

4. Protter (2004) takes a different approach by defining stochastic processes which are good integrators, a more abstract motivation than the one we give here.

5. The requirement of progressive measurability can be relaxed.

# Bibliography

Handel, Ramon von. 2007. “Stochastic Calculus, Filtering, and Stochastic Control (Lecture Notes).”

Mörters, P, Y Peres, O Schramm, and W Werner. 2010. Brownian motion. Cambridge Series on Statistical and Probabilistic Mathematics. Cambridge University Press. http://books.google.co.uk/books?id=e-TbA-dSrzYC.

Protter, P.E. 2004. Stochastic Integration and Differential Equations: Version 2.1. Applications of Mathematics. Springer. http://books.google.co.uk/books?id=mJkFuqwr5xgC.

Rogers, L.C.G., and D. Williams. 2000. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus. Cambridge Mathematical Library. Cambridge University Press. https://books.google.co.uk/books?id=bDQy-zoHWfcC.

# Conditional Expectation under Change of Measure

## Theorem

Let $\mathbb{P}$ and $\mathbb{Q}$ be measures on $(\Omega, {\mathcal{F}})$ with $\mathbb{Q} \ll \mathbb{P}$, ${\mathcal{G}} \subset {\mathcal{F}}$ a sub $\sigma$-algebra and $X$ an integrable random variable ($\mathbb{P}\lvert{X}\rvert < \infty$) then

$\displaystyle \mathbb{P}(X\vert {\mathcal{G}}) = \frac {\mathbb{Q}\bigg(X\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)} {\mathbb{Q}\bigg(\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)}$

## Proof

\displaystyle \begin{aligned} \mathbb{Q}\bigg(\mathbb{I}_A \mathbb{Q}\bigg(X \frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{P}}\bigg\vert {\mathcal{G}}\bigg)\bigg) &= \mathbb{Q}\bigg(\mathbb{I}_A X \frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{P}}\bigg) \\ &= \mathbb{P}\big(\mathbb{I}_A X \big) \\ &= \mathbb{P}\big(\mathbb{I}_A \mathbb{P}(X \vert {\mathcal{G}})\big) \\ &= \mathbb{Q}\bigg(\mathbb{I}_A \frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\mathbb{P}(X \vert {\mathcal{G}})\bigg) \\ &= \mathbb{Q}\bigg(\mathbb{I}_A \mathbb{Q}\bigg(\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)\mathbb{P}(X \vert {\mathcal{G}})\bigg) \\ \end{aligned}

Thus

$\displaystyle \mathbb{Q}\bigg(\mathbb{I}_A\mathbb{P}(X\vert {\mathcal{G}}){\mathbb{Q}\bigg(\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)}\bigg) = \mathbb{Q}\bigg(\mathbb{I}_A \mathbb{Q}\bigg(X\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)\bigg)\quad \mathrm{for\,all}\, A \in {\mathcal{G}}$

Hence

$\displaystyle \mathbb{P}(X\vert {\mathcal{G}}){\mathbb{Q}\bigg(\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)} = \mathbb{Q}\bigg(X\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)\quad {\mathbb{Q}-\mathrm{a.s.}}$

We note that

$\displaystyle A = \bigg\{\omega \,\bigg\vert\, \mathbb{Q}\bigg(\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg)\bigg\}$

is ${\mathcal{G}}$-measurable (it is the result of a projection) and that

$\displaystyle 0 = \mathbb{Q}\bigg(\mathbb{I}_A\mathbb{Q}\bigg( \frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}} \bigg\vert {\mathcal{G}}\bigg)\bigg) = \mathbb{Q}\bigg(\mathbb{I}_A \frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}} \bigg) = \mathbb{P}(\mathbb{I}_A)$

Hence

$\displaystyle \mathbb{P}(X\vert {\mathcal{G}}) = \frac {\mathbb{Q}\bigg(X\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)} {\mathbb{Q}\bigg(\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\bigg\vert {\mathcal{G}}\bigg)}\quad {\mathbb{P}-\mathrm{a.s.}}$

as required.

# Introduction

If you look at the wikipedia article on Hidden Markov Models (HMMs) then you might be forgiven for concluding that these deal only with discrete time and finite state spaces. In fact, HMMs are much more general. Furthermore, a better understanding of such models can be helped by putting them into context. Before actually specifying what an HMM is, let us review something of Markov processes. A subsequent blog post will cover HMMs themselves.

# Markov Process and Chains

Recall that a transition kernel is a mapping $\mu : X \times {\cal{Y}} \rightarrow \overline{\mathbb{R}}_{+}$ where $(X, {\cal{X}})$ and $(Y, {\cal{Y}})$ are two measurable spaces such that $\mu(s, \cdot)$ is a probability measure on ${\cal{Y}}$ for all $s \in X$ and such that $\mu(\cdot, A)$ is a measurable function on $X$ for all $A \in {\cal{Y}}$.

For example, we could have $X = Y = \{a, b\}$ and ${\cal{X}} = {\cal{Y}} = \{\emptyset, \{a\}, \{b\}, \{a,b\}\}$ and $\mu(a,\{a\}) = 0.4, \mu(a,\{b\}) = 0.6, \mu(b,\{a\}) = 0.6, \mu(b,\{b\}) = 0.4$. Hopefully this should remind you of the transition matrix of a Markov chain.

Recall further that a family of such transitions $\{\mu_t\}_{t \in T}$ where $T$ is some index set satisfying

$\displaystyle \mu_{t+s}(x, A) = \int_{Y} \mu_s(x, {\mathrm{d}y})\mu_t(y, A)$

gives rise to a Markov process (under some mild conditions — see Rogers and Williams (2000) and Kallenberg (2002) for much more detail), that is, a process in which what happens next only depends on where the process is now and not how it got there.

Let us carry on with our example and take $T = \mathbb{N}$. With a slight abuse of notation and since $Y$ is finite we can re-write the integral as a sum

$\displaystyle \mu_{n+m}(x, z) = \sum_{y \in Y} \mu_m(x, y)\mu_n(y, z)$

which we recognise as a restatement of how Markov transition matrices combine.

# Some Examples

## A Fully Deterministic System

A deterministic system can be formulated as a Markov process with a particularly simple transition kernel given by

$\displaystyle \mu_t(x_s, A) = \delta(f_t(x_s), A) \triangleq \begin{cases} 1 & \text{if } f_t(x_s) \in A \\ 0 & \text{if } f_t(x_s) \notin A \end{cases}$

where $f_t$ is the deterministic state update function (the flow) and $\delta$ is the Dirac delta function.

## Parameters

Let us suppose that the determinstic system is dependent on some time-varying values for which we we are unable or unwish to specify a deterministic model. For example, we may be considering predator-prey model where the parameters cannot explain every aspect. We could augment the deterministic kernel in the previous example with

$\displaystyle \mu_t(\theta_s, {\mathrm{d}\phi}) = {{\cal{N}} \left( {\mathrm{d}\phi} \,\vert\, \theta_s, \tau^2(t-s) \right)}$

where we use Greek letters for the parameters (and Roman letters for state) and we use e.g. ${\mathrm{d}\phi}$ to indicate probability densities. In other words that the parameters tend to wiggle around like Brown’s pollen particles rather than remaining absolutely fixed.

## Ornstein-Uhlenbeck

Of course Brownian motion or diffusion may not be a good model for our parameters; with Brownian motion, the parameters could drift off to $\pm\infty$. We might believe that our parameters tend to stay close to some given value (mean-reverting) and use the Ornstein-Uhlenbeck kernel.

$\displaystyle \mu_t(\theta_s, {\mathrm{d}\phi}) = {{\cal{N}} \left( {\mathrm{d}\phi} \,\vert\, \alpha + (\theta_s - \alpha)e^{-\beta t},\frac{\sigma^2}{2\beta}\big(1 - e^{-2\beta t}\big) \right)}$

where $\beta$ expresses how strongly we expect the parameter to respond to perturbations, $\alpha$ is the mean to which the process wants to revert (aka the asymptotic mean) and $\sigma^2$ expresses how noisy the process is.

It is sometimes easier to view these transition kernels in terms of stochastic differential equations. Brownian motion can be expressed as

$\displaystyle \mathrm{d}X_t = \sigma\mathrm{d}W_t$

and Ornstein-Uhlenbeck can be expressed as

$\displaystyle \mathrm{d}X_t = -\beta(X_t - \alpha)\mathrm{d}t + \sigma\mathrm{d}W_t$

where $W_t$ is the Wiener process.

Let us check that the latter stochastic differential equation gives the stated kernel. Re-writing it in integral form and without loss of generality taking $s= 0$

$\displaystyle X_t = \alpha + (x_0 - \alpha)e^{-\beta t} + \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s$

Since the integral is of a deterministic function, the distribution of $X_t$ is normal. Thus we need only calculate the mean and variance.

The mean is straightforward.

$\displaystyle \mathbb{E}[X_t \,\vert\, X_0 = x_0] = \mathbb{E}\Bigg[\alpha + (x_0 - \alpha)e^{-\beta t} + \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s \,\vert\, X_0 = x_0\Bigg] = \alpha + (x_0 - \alpha)e^{-\beta t}$

Without loss of generality assume $t \leq u$ and writing $\mathbb{C}$ for covariance

\displaystyle \begin{aligned} \mathbb{C}[X_u, X_t \,\vert\, X_0 = x_0] &= \mathbb{E}\Bigg[\Bigg( \sigma\int_0^u e^{-\beta(u - s)}\mathrm{d}W_s \Bigg) \Bigg( \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s \Bigg)\Bigg] \\ &= \sigma^2e^{-\beta(u + t)} \mathbb{E}\Bigg[\Bigg(\int_0^u e^{\beta s}\mathrm{d}W_s\Bigg) \Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s\Bigg)\Bigg] \\ &= \sigma^2e^{-\beta(u + t)} \mathbb{E}\Bigg[\Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s + \int_t^u e^{\beta s}\mathrm{d}W_s\Bigg) \Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s\Bigg)\Bigg] \end{aligned}

And now we can use Ito and independence

\displaystyle \begin{aligned} \mathbb{C}[X_u, X_t \,\vert\, X_0 = x_0] &= \sigma^2e^{-\beta(u + t)}\mathbb{E}\Bigg[ \int_0^t e^{2\beta s}\mathrm{d}s \Bigg] \\ &= \frac{\sigma^2e^{-\beta(u + t)}}{2\beta}\big(e^{2\beta t} - 1\big) \end{aligned}

Substituting in $t = u$ gives the desired result.

# Bibliography

Kallenberg, O. 2002. Foundations of Modern Probability. Probability and Its Applications. Springer New York. http://books.google.co.uk/books?id=TBgFslMy8V4C.

Rogers, L. C. G., and David Williams. 2000. Diffusions, Markov Processes, and Martingales. Vol. 1. Cambridge Mathematical Library. Cambridge: Cambridge University Press.

# Population Growth Estimation via Hamiltonian Monte Carlo

Here’s the same analysis of estimating population growth using Stan.

data {
int<lower=0> N; // number of observations
vector[N] y;    // observed population
}

parameters {
real r;
}

model {
real k;
real p0;
real deltaT;
real sigma;
real mu0;
real sigma0;
vector[N] p;
k      <- 1.0;
p0     <- 0.1;
deltaT <- 0.0005;
sigma  <- 0.01;
mu0    <- 5;
sigma0 <- 10;

r ~ normal(mu0, sigma0);

for (n in 1:N) {
p[n] <- k * p0 * exp((n - 1) * r * deltaT) / (k + p0 * (exp((n - 1) * r * deltaT) - 1));
y[n] ~ normal(p[n], sigma);
}
}

Empirically, by looking at the posterior, this seems to do a better job than either extended Kalman or vanilla Metropolis.

# Introduction

Let us see if we can estimate the parameter for population growth using MCMC in the example in which we used Kalman filtering.

We recall the model.

\displaystyle \begin{aligned} \dot{p} & = rp\Big(1 - \frac{p}{k}\Big) \end{aligned}

$\displaystyle p = \frac{kp_0\exp rt}{k + p_0(\exp rt - 1)}$

And we are allowed to sample at regular intervals

\displaystyle \begin{aligned} p_i &= \frac{kp_0\exp r\Delta T i}{k + p_0(\exp r\Delta T i - 1)} \\ y_i &= p_i + \epsilon_i \end{aligned}

In other words $y_i \sim {\cal{N}}(p_i, \sigma^2)$, where $\sigma$ is known so the likelihood is

$\displaystyle p(y\,|\,r) \propto \prod_{i=1}^n \exp{\bigg( -\frac{(y_i - p_i)^2}{2\sigma^2}\bigg)} = \exp{\bigg( -\sum_{i=1}^n \frac{(y_i - p_i)^2}{2\sigma^2}\bigg)}$

Let us assume a prior of $r \sim {\cal{N}}(\mu_0,\sigma_0^2)$ then the posterior becomes

$\displaystyle p(r\,|\,y) \propto \exp{\bigg( -\frac{(r - \mu_0)^2}{2\sigma_0^2} \bigg)} \exp{\bigg( -\sum_{i=1}^n \frac{(y_i - p_i)^2}{2\sigma^2}\bigg)}$

## Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}

> {-# LANGUAGE NoMonomorphismRestriction     #-}

> module PopGrowthMCMC where
>
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )


# Implementation

We assume most of the parameters are known with the exception of the the growth rate $r$. We fix this also in order to generate test data.

> k, p0 :: Double
> k = 1.0
> p0 = 0.1

> r, deltaT :: Double
> r = 10.0
> deltaT = 0.0005

> nObs :: Int
> nObs = 150


Here’s the implementation of the logistic function

> logit :: Double -> Double -> Double -> Double
> logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))


Let us create some noisy data.

> sigma :: Double
> sigma = 1e-2

> samples :: [Double]
> samples = zipWith (+) mus epsilons
>   where
>     mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])
>     epsilons = evalState (sample $replicateM nObs$ rvar (Normal 0.0 sigma)) (pureMT 3)


Arbitrarily let us set the prior to have a rather vague normal distribution.

> mu0, sigma0 :: Double
> mu0 = 5.0
> sigma0 = 1e1

> prior :: Double -> Double
> prior r = exp (-(r - mu0)**2 / (2 * sigma0**2))

> likelihood :: Double -> [Double] -> Double
> likelihood r ys = exp (-sum (zipWith (\y mu -> (y - mu)**2 / (2 * sigma**2)) ys mus))
>   where
>     mus :: [Double]
>     mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])

> posterior :: Double -> [Double] -> Double
> posterior r ys = likelihood r ys * prior r


The Metropolis algorithm tells us that we always jump to a better place but only sometimes jump to a worse place. We count the number of acceptances as we go.

> acceptanceProb' :: Double -> Double -> [Double] -> Double
> acceptanceProb' r r' ys = min 1.0 ((posterior r' ys) / (posterior r ys))

> oneStep :: (Double, Int) -> (Double, Double) -> (Double, Int)
> oneStep (r, nAccs) (proposedJump, acceptOrReject) =
>   if acceptOrReject < acceptanceProb' r (r + proposedJump) samples
>   then (r + proposedJump, nAccs + 1)
>   else (r, nAccs)


Here are our proposals.

> normalisedProposals :: Int -> Double -> Int -> [Double]
> normalisedProposals seed sigma nIters =
>   evalState (replicateM nIters (sample (Normal 0.0 sigma)))
>   (pureMT $fromIntegral seed)  We also need samples from the uniform distribution > acceptOrRejects :: Int -> Int -> [Double] > acceptOrRejects seed nIters = > evalState (replicateM nIters (sample stdUniform)) > (pureMT$ fromIntegral seed)


Now we can actually run our simulation. We set the number of jumps and a burn in but do not do any thinning.

> nIters, burnIn :: Int
> nIters = 100000
> burnIn = nIters div 10


Let us start our chain to the mean of the prior. In theory this shoudn’t matter as by the time we have burnt in we should be sampling in the high density region of the distribution.

> startMu :: Double
> startMu = 5.0


This jump size should allow us to sample the region of high density at a reasonable granularity.

> jumpVar :: Double
> jumpVar = 0.01


Now we can test our MCMC implementation.

> test :: Int -> [(Double, Int)]
> test seed =
>   drop burnIn $> scanl oneStep (startMu, 0)$
>   zip (normalisedProposals seed jumpVar nIters)
>       (acceptOrRejects seed nIters)


We put the data into a histogram.

> numBins :: Int
> numBins = 400

> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
>   where
>     lower = r - 0.5 * sigma0
>     upper = r + 0.5 * sigma0
>
> hist :: Int -> Histogram V.Vector BinD Double
> hist seed = fillBuilder hb (map fst $test seed)  With 50 observations we don’t seem to be very certain about the growth rate. With 100 observations we do very much better. And with 150 observations we do even better. # The Flow of the Thames: An Autoregressive Model # Thames Flux It is roughly 150 miles from the source of the Thames to Kingston Bridge. If we assume that it flows at about 2 miles per hour then the water at Thames Head will have reached Kingston very roughly at $\frac{150}{24\times 2} \approxeq 3$ days. The Environmental Agency measure the flux at Kingston Bridge on a twice daily basis. Can we predict this? In the first instance without any other data and using our observation that Thames flushes itself every 3 days, let us try $\displaystyle X_t = \theta_1 X_{t-1} + \theta_2 X_{t-2} + \theta_3 X_{t-3} + \epsilon_t$ where $X_t$ is the flux on day $t$ and $\{\epsilon_t\}_{t \in \mathbb{N}}$ are independent normal errors with mean $0$ and variance some given value $\sigma^2$. # Kalman As it stands, our model is not Markov so we cannot directly apply techniques such as Kalman filtering or particle filtering to estimate the parameters. However we can re-write the model as $\displaystyle \begin{bmatrix} \theta_1^{(t)} \\ \theta_2^{(t)} \\ \theta_3^{(t)} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \theta_1^{(t-1)} \\ \theta_2^{(t-1)} \\ \theta_3^{(t-1)} \end{bmatrix} + \begin{bmatrix} \eta_{t} \\ \eta_{t} \\ \eta_{t} \end{bmatrix}$ $\displaystyle y_t = \begin{bmatrix} x_{t-1} & x_{t-2} & x_{t-3} \end{bmatrix} \begin{bmatrix} \theta_1^{(t)} \\ \theta_2^{(t)} \\ \theta_3^{(t)} \end{bmatrix} + \epsilon_{t}$ Note that the observation map now varies over time so we have modify our Kalman filter implementation to accept a different matrix at each step. > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-}  > {-# LANGUAGE DataKinds #-} > {-# LANGUAGE ScopedTypeVariables #-} > {-# LANGUAGE RankNTypes #-} > {-# LANGUAGE TypeOperators #-} > {-# LANGUAGE TypeFamilies #-}  > module Autoregression ( > predictions > ) where  > import GHC.TypeLits > import Numeric.LinearAlgebra.Static > import Data.Maybe ( fromJust )  > import qualified Data.Vector as V  > inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n > inv m = fromJust$ linSolve m eye

> outer ::  forall m n . (KnownNat m, KnownNat n,
>                         (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
>           R n -> Sq n -> [L m n] -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer muPrior sigmaPrior bigHs bigSigmaY bigA bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) (zip ys bigHs)
>
>     update :: (R n, Sq n) -> (R m, L m n) -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) (y, bigH) =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>         v :: R m
>         v = y - bigH #> xHatFlat
>         bigS :: Sq m
>         bigS = bigH  bigSigmaHatFlat  (tr bigH) + bigSigmaY
>         bigK :: L n m
>         bigK = bigSigmaHatFlat  (tr bigH)  (inv bigS)
>         xHat :: R n
>         xHat = xHatFlat + bigK #> v
>         bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK  bigS  (tr bigK)
>         xHatFlatNew :: R n
>         xHatFlatNew = bigA #> xHat
>         bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA  bigSigmaHat  (tr bigA) + bigSigmaX


We can now set up the parameters to run the filter.

> stateVariance :: Double
> stateVariance = 1e-8

> bigSigmaX :: Sq 3
> bigSigmaX = fromList [ stateVariance, 0.0,           0.0
>                      , 0.0,           stateVariance, 0.0
>                      , 0.0,           0.0,           stateVariance
>                      ]

> bigA :: Sq 3
> bigA = eye

> muPrior :: R 3
> muPrior = fromList [0.0, 0.0, 0.0]

> sigmaPrior :: Sq 3
> sigmaPrior = fromList [ 1e1, 0.0, 0.0
>                       , 0.0, 1e1, 0.0
>                       , 0.0, 0.0, 1e1
>                       ]

> bigHsBuilder :: V.Vector Double -> [L 1 3]
> bigHsBuilder flows =
>   V.toList $> V.zipWith3 (\x0 x1 x2 -> fromList [x0, x1, x2]) > (V.tail flows) (V.tail$ V.tail flows) (V.tail $V.tail$ V.tail flows)

> obsVariance :: Double
> obsVariance = 1.0e-2

> bigSigmaY :: Sq 1
> bigSigmaY = fromList [ obsVariance ]

> predict :: R 3 -> Double -> Double -> Double -> Double
> predict theta f1 f2 f3 = h1 * f1 + h2 * f2 + h3 * f3
>   where
>     (h1, t1) = headTail theta
>     (h2, t2) = headTail t1
>     (h3, _)  = headTail t2

> thetas :: V.Vector Double -> [(R 3, Sq 3)]
> thetas flows = outer muPrior sigmaPrior (bigHsBuilder flows)
>                bigSigmaY bigA bigSigmaX (map (fromList . return) (V.toList flows))

> predictions :: V.Vector Double -> V.Vector Double
> predictions flows =
>   V.zipWith4 predict
>   (V.fromList $map fst (thetas flows)) > flows (V.tail flows) (V.tail$ V.tail flows)


# How Good is Our Model?

If we assume that parameters are essentially fixed by taking the state variance to be e.g. $10^{-8}$ then the fit is not good.

However, if we assume the parameters to undergo Brownian motion by taking the state variance to be e.g. $10^{-2}$ then we get a much better fit. Of course, Brownian motion is probably not a good way of modelling the parameters; we hardly expect that these could wander off to infinity.