Ecology, Dynamical Systems and Inference via PMMH

Introduction

In the 1920s, Lotka (1909) and Volterra (1926) developed a model of a very simple predator-prey ecosystem.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1  - c_1 N_1 N_2 \label{eq2a} \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & c_2 N_1 N_2 - \rho_2 N2 \label{eq2b}  \end{aligned}

Although simple, it turns out that the Canadian lynx and showshoe hare are well represented by such a model. Furthermore, the Hudson Bay Company kept records of how many pelts of each species were trapped for almost a century, giving a good proxy of the population of each species.

We can capture the fact that we do not have a complete model by describing our state of ignorance about the parameters. In order to keep this as simple as possible let us assume that log parameters undergo Brownian motion. That is, we know the parameters will jiggle around and the further into the future we look the less certain we are about what values they will have taken. By making the log parameters undergo Brownian motion, we can also capture our modelling assumption that birth, death and predation rates are always positive. A similar approach is taken in Dureau, Kalogeropoulos, and Baguelin (2013) where the (log) parameters of an epidemiological model are taken to be Ornstein-Uhlenbeck processes (which is biologically more plausible although adds to the complexity of the model, something we wish to avoid in an example such as this).

Andrieu, Doucet, and Holenstein (2010) propose a method to estimate the parameters of such models (Particle Marginal Metropolis Hastings aka PMMH) and the domain specific probabilistic language LibBi (Murray (n.d.)) can be used to apply this (and other inference methods).

For the sake of simplicity, in this blog post, we only model one parameter as being unknown and undergoing Brownian motion. A future blog post will consider more sophisticated scenarios.

A Dynamical System Aside

The above dynamical system is structurally unstable (more on this in a future post), a possible indication that it should not be considered as a good model of predator–prey interaction. Let us modify this to include carrying capacities for the populations of both species.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1 \bigg(1 - \frac{N_1}{K_1}\bigg) - c_1 N_1 N_2 \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & -\rho_2 N_2 \bigg(1 + \frac{N_2}{K_2}\bigg) + c_2 N_1 N_2  \end{aligned}

Data Generation with LibBi

Let’s generate some data using LibBi.

// Generate data assuming a fixed growth rate for hares rather than
// e.g. a growth rate that undergoes Brownian motion.

model PP {
  const h         = 0.1;    // time step
  const delta_abs = 1.0e-3; // absolute error tolerance
  const delta_rel = 1.0e-6; // relative error tolerance

  const a  = 5.0e-1 // Hare growth rate
  const k1 = 2.0e2  // Hare carrying capacity
  const b  = 2.0e-2 // Hare death rate per lynx
  const d  = 4.0e-1 // Lynx death rate
  const k2 = 2.0e1  // Lynx carrying capacity
  const c  = 4.0e-3 // Lynx birth rate per hare

  state P, Z       // Hares and lynxes
  state ln_alpha   // Hare growth rate - we express it in log form for
                   // consistency with the inference model
  obs P_obs        // Observations of hares

  sub initial {
    P ~ log_normal(log(100.0), 0.2)
    Z ~ log_normal(log(50.0), 0.1)
  }

  sub transition(delta = h) {
    ode(h = h, atoler = delta_abs, rtoler = delta_rel, alg = 'RK4(3)') {
      dP/dt =  a * P * (1 - P / k1) - b * P * Z
      dZ/dt = -d * Z * (1 + Z / k2) + c * P * Z
    }
  }

  sub observation {
    P_obs ~ log_normal(log(P), 0.1)
  }
}

We can look at phase space starting with different populations and see they all converge to the same fixed point.

Data Generation with Haskell

Since at some point in the future, I plan to produce Haskell versions of the methods given in Andrieu, Doucet, and Holenstein (2010), let’s generate the data using Haskell.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> module LotkaVolterra (
>     solLv
>   , solPp
>   , h0
>   , l0
>   , baz
>   , logBM
>   , eulerEx
>   )where
> import Numeric.GSL.ODE
> import Numeric.LinearAlgebra
> import Data.Random.Source.PureMT
> import Data.Random hiding ( gamma )
> import Control.Monad.State

Here’s the unstable model.

> lvOde :: Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          [Double] ->
>          [Double]
> lvOde rho1 c1 rho2 c2 _t [h, l] =
>   [
>     rho1 * h - c1 * h * l
>   , c2 * h * l - rho2 * l
>   ]
> lvOde _rho1 _c1 _rho2 _c2 _t vars =
>   error $ "lvOde called with: " ++ show (length vars) ++ " variable"
> rho1, c1, rho2, c2 :: Double
> rho1 = 0.5
> c1 = 0.02
> rho2 = 0.4
> c2 = 0.004
> deltaT :: Double
> deltaT = 0.1
> solLv :: Matrix Double
> solLv = odeSolve (lvOde rho1 c1 rho2 c2)
>                  [50.0, 50.0]
>                  (fromList [0.0, deltaT .. 50])

And here’s the stable model.

> ppOde :: Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          [Double] ->
>          [Double]
> ppOde a k1 b d k2 c _t [p, z] =
>   [
>     a * p * (1 - p / k1) - b * p * z
>   , -d * z * (1 + z / k2) + c * p * z
>   ]
> ppOde _a _k1 _b _d _k2 _c _t vars =
>   error $ "ppOde called with: " ++ show (length vars) ++ " variable"
> a, k1, b, d, k2, c :: Double
> a = 0.5
> k1 = 200.0
> b = 0.02
> d = 0.4
> k2 = 50.0
> c = 0.004
> solPp :: Double -> Double -> Matrix Double
> solPp x y = odeSolve (ppOde a k1 b d k2 c)
>                  [x, y]
>                  (fromList [0.0, deltaT .. 50])
> gamma, alpha, beta :: Double
> gamma = d / a
> alpha = a / (c * k1)
> beta  = d / (a * k2)
> fp :: (Double, Double)
> fp = ((gamma + beta) / (1 + alpha * beta), (1 - gamma * alpha) / (1 + alpha * beta))
> h0, l0 :: Double
> h0 = a * fst fp / c
> l0 = a * snd fp / b
> foo, bar :: Matrix R
> foo = matrix 2 [a / k1, b, c, negate d / k2]
> bar = matrix 1 [a, d]
> baz :: Maybe (Matrix R)
> baz = linearSolve foo bar

This gives a stable fixed point of

ghci> baz
  Just (2><1)
   [ 120.00000000000001
   ,               10.0 ]

Here’s an example of convergence to that fixed point in phase space.

The Stochastic Model

Let us now assume that the Hare growth parameter undergoes Brownian motion so that the further into the future we go, the less certain we are about it. In order to ensure that this parameter remains positive, let’s model the log of it to be Brownian motion.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1 \bigg(1 - \frac{N_1}{K_1}\bigg) - c_1 N_1 N_2 \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & -\rho_2 N_2 \bigg(1 + \frac{N_2}{K_2}\bigg) + c_2 N_1 N_2 \\  \mathrm{d} \rho_1 & = & \rho_1 \sigma_{\rho_1} \mathrm{d}W_t  \end{aligned}

where the final equation is a stochastic differential equation with W_t being a Wiener process.

By Itô we have

\displaystyle   \mathrm{d} (\log{\rho_1}) = - \frac{\sigma_{\rho_1}^2}{2} \mathrm{d} t + \sigma_{\rho_1} \mathrm{d}W_t

We can use this to generate paths for \rho_1.

\displaystyle   \rho_1(t + \Delta t) = \rho_1(t)\exp{\bigg(- \frac{\sigma_{\rho_1}^2}{2} \Delta t + \sigma_{\rho_1} \sqrt{\Delta t} Z\bigg)}

where Z \sim {\mathcal{N}}(0,1).

> oneStepLogBM :: MonadRandom m => Double -> Double -> Double -> m Double
> oneStepLogBM deltaT sigma rhoPrev = do
>   x <- sample $ rvar StdNormal
>   return $ rhoPrev * exp(sigma * (sqrt deltaT) * x - 0.5 * sigma * sigma * deltaT)
> iterateM :: Monad m => (a -> m a) -> m a -> Int -> m [a]
> iterateM f mx n = sequence . take n . iterate (>>= f) $ mx
> logBMM :: MonadRandom m => Double -> Double -> Int -> Int -> m [Double]
> logBMM initRho sigma n m =
>   iterateM (oneStepLogBM (recip $ fromIntegral n) sigma) (return initRho) (n * m)
> logBM :: Double -> Double -> Int -> Int -> Int -> [Double]
> logBM initRho sigma n m seed =
>   evalState (logBMM initRho sigma n m) (pureMT $ fromIntegral seed)

We can see the further we go into the future the less certain we are about the value of the parameter.

Using this we can simulate the whole dynamical system which is now a stochastic process.

> f1, f2 :: Double -> Double -> Double ->
>           Double -> Double ->
>           Double
> f1 a k1 b p z = a * p * (1 - p / k1) - b * p * z
> f2 d k2 c p z = -d * z * (1 + z / k2) + c * p * z
> oneStepEuler :: MonadRandom m =>
>                 Double ->
>                 Double ->
>                 Double -> Double ->
>                 Double -> Double -> Double ->
>                 (Double, Double, Double) ->
>                 m (Double, Double, Double)
> oneStepEuler deltaT sigma k1 b d k2 c (rho1Prev, pPrev, zPrev) = do
>     let pNew = pPrev + deltaT * f1 (exp rho1Prev) k1 b pPrev zPrev
>     let zNew = zPrev + deltaT * f2 d k2 c pPrev zPrev
>     rho1New <- oneStepLogBM deltaT sigma rho1Prev
>     return (rho1New, pNew, zNew)
> euler :: MonadRandom m =>
>          (Double, Double, Double) ->
>          Double ->
>          Double -> Double ->
>          Double -> Double -> Double ->
>          Int -> Int ->
>          m [(Double, Double, Double)]
> euler stateInit sigma k1 b d k2 c n m =
>   iterateM (oneStepEuler (recip $ fromIntegral n) sigma k1 b d k2 c)
>            (return stateInit)
>            (n * m)
> eulerEx :: (Double, Double, Double) ->
>            Double -> Int -> Int -> Int ->
>            [(Double, Double, Double)]
> eulerEx stateInit sigma n m seed =
>   evalState (euler stateInit sigma k1 b d k2 c n m) (pureMT $ fromIntegral seed)

We see that the populations become noisier the further into the future we go.

Notice that the second order effects of the system are now to some extent captured by the fact that the growth rate of Hares can drift. In our simulation, this is demonstrated by our decreasing lack of knowledge the further we look into the future.

Inference

Now let us infer the growth rate using PMMH. Here’s the model expressed in LibBi.

// Infer growth rate for hares

model PP {
  const h         = 0.1;    // time step
  const delta_abs = 1.0e-3; // absolute error tolerance
  const delta_rel = 1.0e-6; // relative error tolerance

  const a  = 5.0e-1 // Hare growth rate - superfluous for inference
                    // but a reminder of what we should expect
  const k1 = 2.0e2  // Hare carrying capacity
  const b  = 2.0e-2 // Hare death rate per lynx
  const d  = 4.0e-1 // Lynx death rate
  const k2 = 2.0e1  // Lynx carrying capacity
  const c  = 4.0e-3 // Lynx birth rate per hare

  state P, Z       // Hares and lynxes
  state ln_alpha   // Hare growth rate - we express it in log form for
                   // consistency with the inference model
  obs P_obs        // Observations of hares
  param mu, sigma  // Mean and standard deviation of hare growth rate
  noise w          // Noise

  sub parameter {
    mu ~ uniform(0.0, 1.0)
    sigma ~ uniform(0.0, 0.5)
  }

  sub proposal_parameter {
     mu ~ truncated_gaussian(mu, 0.02, 0.0, 1.0);
     sigma ~ truncated_gaussian(sigma, 0.01, 0.0, 0.5);
   }

  sub initial {
    P ~ log_normal(log(100.0), 0.2)
    Z ~ log_normal(log(50.0), 0.1)
    ln_alpha ~ gaussian(log(mu), sigma)
  }

  sub transition(delta = h) {
    w ~ normal(0.0, sqrt(h));
    ode(h = h, atoler = delta_abs, rtoler = delta_rel, alg = 'RK4(3)') {
      dP/dt =  exp(ln_alpha) * P * (1 - P / k1) - b * P * Z
      dZ/dt = -d * Z * (1 + Z / k2) + c * P * Z
      dln_alpha/dt = -sigma * sigma / 2 - sigma * w / h
    }
  }

  sub observation {
    P_obs ~ log_normal(log(P), 0.1)
  }
}

Let’s look at the posteriors of the hyper-parameters for the Hare growth parameter.

The estimate for \mu is pretty decent. For our generated data, \sigma =0 and given our observations are quite noisy maybe the estimate for this is not too bad also.

Appendix: The R Driving Code

All code including the R below can be downloaded from github but make sure you use the straight-libbi branch and not master.

install.packages("devtools")
library(devtools)
install_github("sbfnk/RBi",ref="master")
install_github("sbfnk/RBi.helpers",ref="master")

rm(list = ls(all.names=TRUE))
unlink(".RData")

library('RBi')
try(detach(package:RBi, unload = TRUE), silent = TRUE)
library(RBi, quietly = TRUE)

library('RBi.helpers')

library('ggplot2', quietly = TRUE)
library('gridExtra', quietly = TRUE)

endTime <- 50

PP <- bi_model("PP.bi")
synthetic_dataset_PP <- bi_generate_dataset(endtime=endTime,
                                            model=PP,
                                            seed="42",
                                            verbose=TRUE,
                                            add_options = list(
                                                noutputs=500))

rdata_PP <- bi_read(synthetic_dataset_PP)

df <- data.frame(rdata_PP$P$nr,
                 rdata_PP$P$value,
                 rdata_PP$Z$value,
                 rdata_PP$P_obs$value)

ggplot(df, aes(rdata_PP$P$nr, y = Population, color = variable), size = 0.1) +
    geom_line(aes(y = rdata_PP$P$value, col = "Hare"), size = 0.1) +
    geom_line(aes(y = rdata_PP$Z$value, col = "Lynx"), size = 0.1) +
    geom_point(aes(y = rdata_PP$P_obs$value, col = "Observations"), size = 0.1) +
    theme(legend.position="none") +
    ggtitle("Example Data") +
    xlab("Days") +
    theme(axis.text=element_text(size=4),
          axis.title=element_text(size=6,face="bold")) +
    theme(plot.title = element_text(size=10))
ggsave(filename="diagrams/LVdata.png",width=4,height=3)

synthetic_dataset_PP1 <- bi_generate_dataset(endtime=endTime,
                                             model=PP,
                                             init = list(P = 100, Z=50),
                                             seed="42",
                                             verbose=TRUE,
                                             add_options = list(
                                                 noutputs=500))

rdata_PP1 <- bi_read(synthetic_dataset_PP1)

synthetic_dataset_PP2 <- bi_generate_dataset(endtime=endTime,
                                             model=PP,
                                             init = list(P = 150, Z=25),
                                             seed="42",
                                             verbose=TRUE,
                                             add_options = list(
                                                 noutputs=500))

rdata_PP2 <- bi_read(synthetic_dataset_PP2)

df1 <- data.frame(hare = rdata_PP$P$value,
                  lynx = rdata_PP$Z$value,
                  hare1 = rdata_PP1$P$value,
                  lynx1 = rdata_PP1$Z$value,
                  hare2 = rdata_PP2$P$value,
                  lynx2 = rdata_PP2$Z$value)

ggplot(df1) +
    geom_path(aes(x=df1$hare,  y=df1$lynx, col = "0"), size = 0.1) +
    geom_path(aes(x=df1$hare1, y=df1$lynx1, col = "1"), size = 0.1) +
    geom_path(aes(x=df1$hare2, y=df1$lynx2, col = "2"), size = 0.1) +
    theme(legend.position="none") +
    ggtitle("Phase Space") +
    xlab("Hare") +
    ylab("Lynx") +
    theme(axis.text=element_text(size=4),
          axis.title=element_text(size=6,face="bold")) +
    theme(plot.title = element_text(size=10))
ggsave(filename="diagrams/PPviaLibBi.png",width=4,height=3)

PPInfer <- bi_model("PPInfer.bi")

bi_object_PP <- libbi(client="sample", model=PPInfer, obs = synthetic_dataset_PP)

bi_object_PP$run(add_options = list(
                     "end-time" = endTime,
                     noutputs = endTime,
                     nsamples = 4000,
                     nparticles = 128,
                     seed=42,
                     nthreads = 1),
                 ## verbose = TRUE,
                 stdoutput_file_name = tempfile(pattern="pmmhoutput", fileext=".txt"))

bi_file_summary(bi_object_PP$result$output_file_name)

mu <- bi_read(bi_object_PP, "mu")$value
g1 <- qplot(x = mu[2001:4000], y = ..density.., geom = "histogram") + xlab(expression(mu))
sigma <- bi_read(bi_object_PP, "sigma")$value
g2 <- qplot(x = sigma[2001:4000], y = ..density.., geom = "histogram") + xlab(expression(sigma))
g3 <- grid.arrange(g1, g2)
ggsave(plot=g3,filename="diagrams/LvPosterior.png",width=4,height=3)


df2 <- data.frame(hareActs = rdata_PP$P$value,
                  hareObs  = rdata_PP$P_obs$value)

ggplot(df, aes(rdata_PP$P$nr, y = value, color = variable)) +
    geom_line(aes(y = rdata_PP$P$value, col = "Phyto")) +
    geom_line(aes(y = rdata_PP$Z$value, col = "Zoo")) +
    geom_point(aes(y = rdata_PP$P_obs$value, col = "Phyto Obs"))

ln_alpha <- bi_read(bi_object_PP, "ln_alpha")$value

P <- matrix(bi_read(bi_object_PP, "P")$value,nrow=51,byrow=TRUE)
Z <- matrix(bi_read(bi_object_PP, "Z")$value,nrow=51,byrow=TRUE)

data50 <- bi_generate_dataset(endtime=endTime,
                              model=PP,
                              seed="42",
                              verbose=TRUE,
                              add_options = list(
                                  noutputs=50))

rdata50 <- bi_read(data50)

df3 <- data.frame(days = c(1:51), hares = rowMeans(P), lynxes = rowMeans(Z),
                                  actHs = rdata50$P$value, actLs = rdata50$Z$value)


ggplot(df3) +
    geom_line(aes(x = days, y = hares, col = "Est Phyto")) +
    geom_line(aes(x = days, y = lynxes, col = "Est Zoo")) +
    geom_line(aes(x = days, y = actHs, col = "Act Phyto")) +
    geom_line(aes(x = days, y = actLs, col = "Act Zoo"))

Bibliography

Andrieu, Christophe, Arnaud Doucet, and Roman Holenstein. 2010. “Particle Markov chain Monte Carlo methods.” Journal of the Royal Statistical Society. Series B: Statistical Methodology 72 (3): 269–342. doi:10.1111/j.1467-9868.2009.00736.x.

Dureau, Joseph, Konstantinos Kalogeropoulos, and Marc Baguelin. 2013. “Capturing the time-varying drivers of an epidemic using stochastic dynamical systems.” Biostatistics (Oxford, England) 14 (3): 541–55. doi:10.1093/biostatistics/kxs052.

Lotka, Alfred J. 1909. “Contribution to the Theory of Periodic Reactions.” The Journal of Physical Chemistry 14 (3): 271–74. doi:10.1021/j150111a004.

Murray, Lawrence M. n.d. “Bayesian State-Space Modelling on High-Performance Hardware Using LibBi.”

Volterra, Vito. 1926. “Variazioni e fluttuazioni del numero d’individui in specie animali conviventi.” Memorie Della R. Accademia Dei Lincei 6 (2): 31–113. http://www.liberliber.it/biblioteca/v/volterra/variazioni{\_}e{\_}fluttuazioni/pdf/volterra{\_}variazioni{\_}e{\_}fluttuazioni.pdf.

Naive Particle Smoothing is Degenerate

Introduction

Let \{X_t\}_{t \geq 1} be a (hidden) Markov process. By hidden, we mean that we are not able to observe it.

\displaystyle  X_1 \sim \mu(\centerdot) \quad X_t \,|\, (X_{t-1} = x) \sim f(\centerdot \,|\, x)

And let \{Y_t\}_{t \geq 1} be an observable Markov process such that

\displaystyle  Y_t \,|\, (X_{t} = x) \sim g(\centerdot \,|\, x)

That is the observations are conditionally independent given the state of the hidden process.

As an example let us take the one given in Särkkä (2013) where the movement of a car is given by Newton’s laws of motion and the acceleration is modelled as white noise.

\displaystyle  \begin{aligned} X_t &= AX_{t-1} + Q \epsilon_t \\ Y_t &= HX_t + R \eta_t \end{aligned}

Although we do not do so here, A, Q, H and R can be derived from the dynamics. For the purpose of this blog post, we note that they are given by

\displaystyle  A = \begin{bmatrix} 1 & 0 & \Delta t & 0        \\ 0 & 1 & 0        & \Delta t \\ 0 & 0 & 1        & 0        \\ 0 & 0 & 0        & 1 \end{bmatrix} , \quad Q = \begin{bmatrix} \frac{q_1^c \Delta t^3}{3} & 0                          & \frac{q_1^c \Delta t^2}{2} & 0                          \\ 0                          & \frac{q_2^c \Delta t^3}{3} & 0                          & \frac{q_2^c \Delta t^2}{2} \\ \frac{q_1^c \Delta t^2}{2} & 0                          & {q_1^c \Delta t}           & 0                          \\ 0                          & \frac{q_2^c \Delta t^2}{2} & 0                          & {q_2^c \Delta t} \end{bmatrix}

and

\displaystyle  H = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix} , \quad R = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{bmatrix}

We wish to determine the position and velocity of the car given noisy observations of the position. In general we need the distribution of the hidden path given the observable path. We use the notation x_{m:n} to mean the path of x starting a m and finishing at n.

\displaystyle  p(x_{1:n} \,|\, y_{1:n}) = \frac{p(x_{1:n}, y_{1:n})}{p(y_{1:n})}

Haskell Preamble

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE FlexibleInstances            #-}
> {-# LANGUAGE MultiParamTypeClasses        #-}
> {-# LANGUAGE FlexibleContexts             #-}
> {-# LANGUAGE TypeFamilies                 #-}
> {-# LANGUAGE BangPatterns                 #-}
> {-# LANGUAGE GeneralizedNewtypeDeriving   #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE TemplateHaskell              #-}
> module ParticleSmoothing
>   ( simpleSamples
>   , carSamples
>   , testCar
>   , testSimple
>   ) where
> import Data.Random.Source.PureMT
> import Data.Random hiding ( StdNormal, Normal )
> import qualified Data.Random as R
> import Control.Monad.State
> import Control.Monad.Writer hiding ( Any, All )
> import qualified Numeric.LinearAlgebra.HMatrix as H
> import Foreign.Storable ( Storable )
> import Data.Maybe ( fromJust )
> import Data.Bits ( shiftR )
> import qualified Data.Vector as V
> import qualified Data.Vector.Unboxed as U
> import Control.Monad.ST
> import System.Random.MWC
> import Data.Array.Repa ( Z(..), (:.)(..), Any(..), computeP,
>                          extent, DIM1, DIM2, slice, All(..)
>                        )
> import qualified Data.Array.Repa as Repa
> import qualified Control.Monad.Loops as ML
> import PrettyPrint ()
> import Text.PrettyPrint.HughesPJClass ( Pretty, pPrint )
> import Data.Vector.Unboxed.Deriving

Some Theory

If we could sample X_{1:n}^{(i)} \sim p(x_{1:n} \,|\, y_{1:n}) then we could approximate the posterior as

\displaystyle  \hat{p}(x_{1:n} \,|\, y_{1:n}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:n}^{(i)}}(x_{1:n})

If we wish to, we can create marginal estimates

\displaystyle  \hat{p}(x_k \,|\, y_{1:n}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{k}^{(i)}}(x_{k})

When k = N, this is the filtering estimate.

Standard Bayesian Recursion

Prediction

\displaystyle  \begin{aligned} p(x_n \,|\, y_{1:n-1}) &= \int p(x_{n-1:n} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\  &= \int p(x_{n} \,|\, x_{n-1}, y_{1:n-1}) \, p(x_{n-1} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\  &= \int f(x_{n} \,|\, x_{n-1}) \, p(x_{n-1} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\ \end{aligned}

Update

\displaystyle  \begin{aligned} p(x_n \,|\, y_{1:n}) &= \frac{p(y_n \,|\, x_n, y_{1:n-1}) \, p(x_n \,|\, y_{1:n-1})}                              {p(y_n \,|\, y_{1:n-1})} \\                      &= \frac{g(y_n \,|\, x_n) \, p(x_n \,|\, y_{1:n-1})}                              {p(y_n \,|\, y_{1:n-1})} \end{aligned}

where by definition

\displaystyle  {p(y_n \,|\, y_{1:n-1})} = \int {g(y_n \,|\, x_n) \, p(x_n \,|\, y_{1:n-1})} \,\mathrm{d}x_n

Path Space Recursion

We have

\displaystyle  \begin{aligned} p(x_{1:n} \,|\, y_{1:n}) &= \frac{p(x_{1:n}, y_{1:n})}{p(y_{1:n})} \\ &= \frac{p(x_n, y_n \,|\, x_{1:n-1}, y_{1:n-1})}{p(y_{1:n})} \, p(x_{1:n-1}, y_{1:n-1}) \\ &= \frac{p(y_n \,|\, x_{1:n}, y_{1:n-1}) \, p(x_n \,|\, x_{1:n-1}, y_{1:n-1}) }{p(y_{1:n})} \, p(x_{1:n-1}, y_{1:n-1}) \\ &= \frac{g(y_n \,|\, x_{n}) \, f(x_n \,|\, x_{n-1})}{p(y_n \,|\, y_{1:n-1})} \, \frac{p(x_{1:n-1}, y_{1:n-1})}{ \, p(y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, f(x_n \,|\, x_{n-1})}{p(y_n \,|\, y_{1:n-1})} \, {p(x_{1:n-1} \,|\,y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, \overbrace{f(x_n \,|\, x_{n-1}) \, {p(x_{1:n-1} \,|\,y_{1:n-1})}}^{\mathrm{predictive}\,p(x_{1:n} \,|\, y_{1:n-1})}} {p(y_n \,|\, y_{1:n-1})} \\ \end{aligned}

where by definition

\displaystyle  p(y_n \,|\, y_{1:n-1}) = \int g(y_n \,|\, x_n) \, p(x_{1:n} \,|\, y_{1:n-1}) \,\mathrm{d}x_{1:n}

Prediction

\displaystyle  p(x_{1:n} \,|\, y_{1:n-1}) = f(x_n \,|\, x_{n-1}) \, {p(x_{1:n-1} \,|\,y_{1:n-1})}

Update

\displaystyle  p(x_{1:n} \,|\, y_{1:n}) = \frac{g(y_n \,|\, x_{n}) \, {p(x_{1:n} \,|\, y_{1:n-1})}} {p(y_n \,|\, y_{1:n-1})}

Algorithm

The idea is to simulate paths using the recursion we derived above.

At time n-1 we have an approximating distribution

\displaystyle  \hat{p}(x_{1:n-1} \,|\, y_{1:n-1}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:n-1}}^{(i)}(x_{1:n-1})

Sample \tilde{X}_n^{(i)} \sim f(\centerdot \,|\, X_{n-1}^{(i)}) and set \tilde{X}_{1:n}^{(i)} = (\tilde{X}_{1:n-1}^{(i)}, \tilde{X}_n^{(i)}). We then have an approximation of the prediction step

\displaystyle  \hat{p}(x_{1:n} \,|\, y_{1:n-1}) = \frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})

Substituting

\displaystyle  \begin{aligned} {\hat{p}(y_n \,|\, y_{1:n-1})} &= \int {g(y_n \,|\, x_n) \, \hat{p}(x_{1:n} \,|\, y_{1:n-1})} \,\mathrm{d}x_n \\ &= \int {g(y_n \,|\, x_n)}\frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n-1}}^{(i)}(x_{1:n}) \,\mathrm{d}x_n \\ &= \frac{1}{N}\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})} \end{aligned}

and again

\displaystyle  \begin{aligned} \tilde{p}(x_{1:n} \,|\, y_{1:n}) &= \frac{g(y_n \,|\, x_{n}) \, {\hat{p}(x_{1:n} \,|\, y_{1:n-1})}}      {\hat{p}(y_n \,|\, y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, \frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})}      {\frac{1}{N}\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})}} \\ &= \frac{ \sum_{i=1}^N g(y_n \,|\, \tilde{X}_n^{(i)}) \, \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})}      {\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})}} \\ &= \sum_{i=1}^N W_n^{(i)} \delta_{\tilde{X}_{1:n}^{(i)}} (x_{1:n}) \end{aligned}

where W_n^{(i)} \propto g(y_n \,|\, \tilde{X}_n^{(i)}) and \sum_{i=1}^N W_n^{(i)} = 1.

Now sample

\displaystyle  X_{1:n}^{(i)} \sim \tilde{p}(x_{1:n} \,|\, y_{1:n})

A Haskell Implementation

Let’s specify some values for the example of the car moving in two dimensions.

> deltaT, sigma1, sigma2, qc1, qc2 :: Double
> deltaT = 0.1
> sigma1 = 1/2
> sigma2 = 1/2
> qc1 = 1
> qc2 = 1
> bigA :: H.Matrix Double
> bigA = (4 H.>< 4) bigAl
> bigAl :: [Double]
> bigAl = [1, 0 , deltaT,      0,
>          0, 1,       0, deltaT,
>          0, 0,       1,      0,
>          0, 0,       0,      1]
> bigQ :: H.Herm Double
> bigQ = H.trustSym $ (4 H.>< 4) bigQl
> bigQl :: [Double]
> bigQl = [qc1 * deltaT^3 / 3,                  0, qc1 * deltaT^2 / 2,                  0,
>                           0, qc2 * deltaT^3 / 3,                  0, qc2 * deltaT^2 / 2,
>          qc1 * deltaT^2 / 2,                  0,       qc1 * deltaT,                  0,
>                           0, qc2 * deltaT^2 / 2,                  0,       qc2 * deltaT]
> bigH :: H.Matrix Double
> bigH = (2 H.>< 4) [1, 0, 0, 0,
>                    0, 1, 0, 0]
> bigR :: H.Herm Double
> bigR = H.trustSym $ (2 H.>< 2) [sigma1^2,        0,
>                                        0, sigma2^2]
> m0 :: H.Vector Double
> m0 = H.fromList [0, 0, 1, -1]
> bigP0 :: H.Herm Double
> bigP0 = H.trustSym $ H.ident 4
> n :: Int
> n = 23

With these we generate hidden and observable sample path.

> carSample :: MonadRandom m =>
>              H.Vector Double ->
>              m (Maybe ((H.Vector Double, H.Vector Double), H.Vector Double))
> carSample xPrev = do
>   xNew <- sample $ rvar (Normal (bigA H.#> xPrev) bigQ)
>   yNew <- sample $ rvar (Normal (bigH H.#> xNew) bigR)
>   return $ Just ((xNew, yNew), xNew)
> carSamples :: [(H.Vector Double, H.Vector Double)]
> carSamples = evalState (ML.unfoldrM carSample m0) (pureMT 17)

We can plot an example trajectory for the car and the noisy observations that are available to the smoother / filter.

Sadly there is no equivalent to numpy in Haskell. Random number packages generate vectors, for multi-rank arrays there is repa and for fast matrix manipulation there is hmtatrix. Thus for our single step path update function, we have to pass in functions to perform type conversion. Clearly with all the copying inherent in this approach, performance is not going to be great.

The type synonym ArraySmoothing is used to denote the cloud of particles at each time step.

> type ArraySmoothing = Repa.Array Repa.U DIM2
> singleStep :: forall a . U.Unbox a =>
>               (a -> H.Vector Double) ->
>               (H.Vector Double -> a) ->
>               H.Matrix Double ->
>               H.Herm Double ->
>               H.Matrix Double ->
>               H.Herm Double ->
>               ArraySmoothing a -> H.Vector Double ->
>               WriterT [ArraySmoothing a] (StateT PureMT IO) (ArraySmoothing a)
> singleStep f g bigA bigQ bigH bigR x y = do
>   tell[x]
>   let (Z :. ix :. jx) = extent x
> 
>   xHatR <- lift $ computeP $ Repa.slice x (Any :. jx - 1)
>   let xHatH = map f $ Repa.toList (xHatR  :: Repa.Array Repa.U DIM1 a)
>   xTildeNextH <- lift $ mapM (\x -> sample $ rvar (Normal (bigA H.#> x) bigQ)) xHatH
> 
>   let xTildeNextR = Repa.fromListUnboxed (Z :. ix :. (1 :: Int)) $
>                     map g xTildeNextH
>       xTilde = Repa.append x xTildeNextR
> 
>       weights = map (normalPdf y bigR) $
>                 map (bigH H.#>) xTildeNextH
>       vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList weights)
>       totWeight = sum weights
>       js = indices (V.map (/ totWeight) $ V.tail cumSumWeights) vs
>       xNewV = V.map (\j -> Repa.transpose $
>                            Repa.reshape (Z :. (1 :: Int) :. jx + 1) $
>                            slice xTilde (Any :. j :. All)) js
>       xNewR = Repa.transpose $ V.foldr Repa.append (xNewV V.! 0) (V.tail xNewV)
>   computeP xNewR

The state for the car is a 4-tuple.

> data SystemState a = SystemState { xPos  :: a
>                                  , yPos  :: a
>                                  , _xSpd :: a
>                                  , _ySpd :: a
>                                  }

We initialize the smoother from some prior distribution.

> initCar :: StateT PureMT IO (ArraySmoothing (SystemState Double))
> initCar = do
>   xTilde1 <- replicateM n $ sample $ rvar (Normal m0 bigP0)
>   let weights = map (normalPdf (snd $ head carSamples) bigR) $
>                 map (bigH H.#>) xTilde1
>       vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList weights)
>       js = indices (V.tail cumSumWeights) vs
>       xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int)) $
>               map ((\[a,b,c,d] -> SystemState a b c d) . H.toList) $
>               V.toList $
>               V.map ((V.fromList xTilde1) V.!) js
>   return xHat1

Now we can run the smoother.

> smootherCar :: StateT PureMT IO
>             (ArraySmoothing (SystemState Double)
>             , [ArraySmoothing (SystemState Double)])
> smootherCar = runWriterT $ do
>   xHat1 <- lift initCar
>   foldM (singleStep f g bigA bigQ bigH bigR) xHat1 (take 100 $ map snd $ tail carSamples)
> f :: SystemState Double -> H.Vector Double
> f (SystemState a b c d) = H.fromList [a, b, c, d]
> g :: H.Vector Double -> SystemState Double
> g = (\[a,b,c,d] -> (SystemState a b c d)) . H.toList

And create inferred positions for the car which we then plot.

> testCar :: IO ([Double], [Double])
> testCar = do
>   states <- snd <$> evalStateT smootherCar (pureMT 24)
>   let xs :: [Repa.Array Repa.D DIM2 Double]
>       xs = map (Repa.map xPos) states
>   sumXs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose xs)
>   let ixs = map extent sumXs
>       sumLastXs = map (* (recip $ fromIntegral n)) $
>                   zipWith (Repa.!) sumXs (map (\(Z :. x) -> Z :. (x - 1)) ixs)
>   let ys :: [Repa.Array Repa.D DIM2 Double]
>       ys = map (Repa.map yPos) states
>   sumYs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose ys)
>   let ixsY = map extent sumYs
>       sumLastYs = map (* (recip $ fromIntegral n)) $
>                   zipWith (Repa.!) sumYs (map (\(Z :. x) -> Z :. (x - 1)) ixsY)
>   return (sumLastXs, sumLastYs)

So it seems our smoother does quite well at inferring the state at the latest observation, that is, when it is working as a filter. But what about estimates for earlier times? We should do better as we have observations in the past and in the future. Let’s try with a simpler example and a smaller number of particles.

First we create some samples for our simple 1 dimensional linear Gaussian model.

> bigA1, bigQ1, bigR1, bigH1 :: Double
> bigA1 = 0.5
> bigQ1 = 0.1
> bigR1 = 0.1
> bigH1 = 1.0
> simpleSample :: MonadRandom m =>
>               Double ->
>               m (Maybe ((Double, Double), Double))
> simpleSample xPrev = do
>   xNew <- sample $ rvar (R.Normal (bigA1 * xPrev) bigQ1)
>   yNew <- sample $ rvar (R.Normal (bigH1 * xNew) bigR1)
>   return $ Just ((xNew, yNew), xNew)
> simpleSamples :: [(Double, Double)]
> simpleSamples = evalState (ML.unfoldrM simpleSample 0.0) (pureMT 17)

Again create a prior.

> initSimple :: MonadRandom m => m (ArraySmoothing Double)
> initSimple = do
>   let y = snd $ head simpleSamples
>   xTilde1 <- replicateM n $ sample $ rvar $ R.Normal y bigR1
>   let weights = map (pdf (R.Normal y bigR1)) $
>                 map (bigH1 *) xTilde1
>       totWeight = sum weights
>       vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList $ map (/ totWeight) weights)
>       js = indices (V.tail cumSumWeights) vs
>       xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int)) $
>               V.toList $
>               V.map ((V.fromList xTilde1) V.!) js
>   return xHat1

Now we can run the smoother.

> smootherSimple :: StateT PureMT IO (ArraySmoothing Double, [ArraySmoothing Double])
> smootherSimple = runWriterT $ do
>   xHat1 <- lift initSimple
>   foldM (singleStep f1 g1 ((1 H.>< 1) [bigA1]) (H.trustSym $ (1 H.>< 1) [bigQ1^2])
>                           ((1 H.>< 1) [bigH1]) (H.trustSym $ (1 H.>< 1) [bigR1^2]))
>         xHat1
>         (take 20 $ map H.fromList $ map return . map snd $ tail simpleSamples)
> f1 :: Double -> H.Vector Double
> f1 a = H.fromList [a]
> g1 :: H.Vector Double -> Double
> g1 = (\[a] -> a) . H.toList

And finally we can look at the paths not just the means of the marginal distributions at the latest observation time.

> testSimple :: IO [[Double]]
> testSimple = do
>   states <- snd <$> evalStateT smootherSimple (pureMT 24)
>   let path :: Int -> IO (Repa.Array Repa.U DIM1 Double)
>       path i = computeP $ Repa.slice (last states) (Any :. i :. All)
>   paths <- mapM path [0..n - 1]
>   return $ map Repa.toList paths

We can see that at some point in the past all the current particles have one ancestor. The marginals of the smoothing distribution (at some point in the past) have collapsed on to one particle.

Notes

Helpers for the Inverse CDF

That these are helpers for the inverse CDF is delayed to another blog post.

> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs
> binarySearch :: Ord a =>
>                 V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
>   where
>     loop !l !u
>       | u <= l    = l
>       | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
>       where k = l + (u - l) `shiftR` 1

Multivariate Normal

The random-fu package does not contain a sampler or pdf for a multivariate normal so we create our own. This should be added to random-fu-multivariate package or something similar.

> normalMultivariate :: H.Vector Double -> H.Herm Double -> RVarT m (H.Vector Double)
> normalMultivariate mu bigSigma = do
>   z <- replicateM (H.size mu) (rvarT R.StdNormal)
>   return $ mu + bigA H.#> (H.fromList z)
>   where
>     (vals, bigU) = H.eigSH bigSigma
>     lSqrt = H.diag $ H.cmap sqrt vals
>     bigA = bigU H.<> lSqrt
> data family Normal k :: *
> data instance Normal (H.Vector Double) = Normal (H.Vector Double) (H.Herm Double)
> instance Distribution Normal (H.Vector Double) where
>   rvar (Normal m s) = normalMultivariate m s
> normalPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) =>
>              H.Vector a -> H.Herm a -> H.Vector a -> a
> normalPdf mu sigma x = exp $ normalLogPdf mu sigma x
> normalLogPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) =>
>                  H.Vector a -> H.Herm a -> H.Vector a -> a
> normalLogPdf mu bigSigma x = - H.sumElements (H.cmap log (diagonals dec))
>                               - 0.5 * (fromIntegral (H.size mu)) * log (2 * pi)
>                               - 0.5 * s
>   where
>     dec = fromJust $ H.mbChol bigSigma
>     t = fromJust $ H.linearSolve (H.tr dec) (H.asColumn $ x - mu)
>     u = H.cmap (\x -> x * x) t
>     s = H.sumElements u
> diagonals :: (Storable a, H.Element t, H.Indexable (H.Vector t) a) =>
>              H.Matrix t -> H.Vector a
> diagonals m = H.fromList (map (\i -> m H.! i H.! i) [0..n-1])
>   where
>     n = max (H.rows m) (H.cols m)
> instance PDF Normal (H.Vector Double) where
>   pdf (Normal m s) = normalPdf m s
>   logPdf (Normal m s) = normalLogPdf m s

Misellaneous

> derivingUnbox "SystemState"
>     [t| forall a . (U.Unbox a) => SystemState a -> (a, a, a, a) |]
>     [| \ (SystemState x y xdot ydot) -> (x, y, xdot, ydot) |]
>     [| \ (x, y, xdot, ydot) -> SystemState x y xdot ydot |]
> instance Pretty a => Pretty (SystemState a) where
>   pPrint (SystemState x y xdot ydot ) = pPrint (x, y, xdot, ydot)

Bibliography

Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. New York, NY, USA: Cambridge University Press.

Stochastic Integration

Introduction

Suppose we wish to model a process described by a differential equation and initial condition

\displaystyle   \begin{aligned}  \dot{x}(t) &= a(x, t) \\  x(0) &= a_0  \end{aligned}

But we wish to do this in the presence of noise. It’s not clear how do to this but maybe we can model the process discretely, add noise and somehow take limits.

Let \pi = \{0 = t_0 \leq t_1 \leq \ldots \leq t_n = t\} be a partition of [0, t] then we can discretise the above, allow the state to be random and add in some noise which we model as samples of Brownian motion at the selected times multiplied by b so that we can vary the amount noise depending on the state. We change the notation from x to X(\omega) to indicate that the variable is now random over some probability space.

\displaystyle   \begin{aligned}  {X}(t_{i+1}, \omega) - {X}(t_i, \omega)  &= a({X}(t_i, \omega))(t_{i+1} - t_i) +                                              b({X}(t_i, \omega))(W(t_{i+1}, \omega) - W(t_i, \omega)) \\  X(t_0, \omega) &= A_{0}(\omega)  \end{aligned}

We can suppress explicit mention of \omega and use subscripts to avoid clutter.

\displaystyle   \begin{aligned}  {X}_{t_{i+1}} - {X}_{t_i}  &= a({X}_{t_i})(t_{i+1} - t_i) +                                b({X}_{t_i})(W_{t_{i+1}} - W_{t_i}) \\  X(t_0) &= A_{0}(\omega)  \end{aligned}

We can make this depend continuously on time specifying that

\displaystyle   X_t = X_{t_i} \quad \mathrm{for} \, t \in (t_i, t_{i+1}]

and then telescoping to obtain

\displaystyle   \begin{aligned}  {X}_{t} &= X_{t_0} + \sum_{i=0}^{k-1} a({X}_{t_i})(t_{i+1} - t_i) +                       \sum_{i=0}^{k-1} b({X}_{t_i})(W_{t_{i+1}} - W_{t_i})                       \quad \mathrm{for} \, t \in (t_k, t_{k+1}]  \end{aligned}

In the limit, the second term on the right looks like an ordinary integral with respect to time albeit the integrand is stochastic but what are we to make of the the third term? We know that Brownian motion is nowhere differentiable so it would seem the task is impossible. However, let us see what progress we can make with so-called simple proceses.

Simple Processes

Let

\displaystyle   X(t,\omega) = \sum_{i=0}^{k-1} B_i(\omega)\mathbb{I}_{(t_i, t_{i+1}]}(t)

where B_i is {\cal{F}}(t_i)-measurable. We call such a process simple. We can then define

\displaystyle   \int_0^\infty X_s \mathrm{d}W_s \triangleq \sum_{i=0}^{k-1} B_i{(W_{t_{i+1}} - W_{t_{i+1}})}

So if we can produce a sequence of simple processes, X_n that converge in some norm to X then we can define

\displaystyle   \int_0^\infty X(s)\mathrm{d}W(s) \triangleq \lim_{n \to \infty}\int_0^\infty X_n(s)\mathrm{d}W(s)

Of course we need to put some conditions of the particular class of stochastic processes for which this is possible and check that the limit exists and is unique.

We consider the {\cal{L}}^2(\mu \times \mathbb{P}), the space of square integrable functions with respect to the product measure \mu \otimes \mathbb{P} where \mu is Lesbegue measure on {\mathbb{R}^+} and \mathbb{P} is some given probability measure. We further restrict ourselves to progressively measurable functions. More explicitly, we consider the latter class of stochastic processes such that

\displaystyle   \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s < \infty

Less Simple Processes

Bounded, Almost Surely Continuous and Progressively Adapted

Let X be a bounded, almost surely continuous and progressively measurable process which is (almost surely) 0 for t > T for some positive constant T. Define

\displaystyle   X_n(t, \omega) \triangleq X\bigg(T\frac{i}{n}, \omega\bigg) \quad \mathrm{for} \quad T\frac{i}{n} \leq t < T\frac{i + 1}{n}

These processes are cleary progressively measurable and by bounded convergence (X is bounded by hypothesis and \{X_n\}_{n=0,\ldots} is uniformly bounded by the same bound).

\displaystyle   \lim_{n \to \infty}\|X - X_n\|_2 = 0

Bounded and Progressively Measurable

Let X be a bounded and progressively measurable process which is (almost surely) 0 for t > T for some positive constant T. Define

\displaystyle   X_n(t, \omega) \triangleq \frac{1}{1/n}\int_{t-1/n}^t X(s, \omega) \,\mathrm{d}s

Then X^n(s, \omega) is bounded, continuous and progressively measurable and it is well known that X^n(t, \omega) \rightarrow X(t, \omega) as n \rightarrow 0. Again by bounded convergence

\displaystyle   \lim_{n \to \infty}\|X - X_n\|_2 = 0

Progressively Measurable

Firstly, let X be a progressively measurable process which is (almost surely) 0 for t > T for some positive constant T. Define X_n(t, \omega) = X(t, \omega) \land n. Then X_n is bounded and by dominated convergence

\displaystyle   \lim_{n \to \infty}\|X - X_n\|_2 = 0

Finally let X be a progressively measurable process. Define

\displaystyle   X_n(t, \omega) \triangleq  \begin{cases}  X(t, \omega) & \text{if } t \leq n \\  0            & \text{if } \mathrm{otherwise}  \end{cases}

Clearly

\displaystyle   \lim_{n \to \infty}\|X - X_n\|_2 = 0

The Itô Isometry

Let X be a simple process such that

\displaystyle   \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s < \infty

then

\displaystyle   \mathbb{E}\bigg(\int_0^\infty X_s\,\mathrm{d}W_s\bigg)^2 =  \mathbb{E}\bigg(\sum_{i=0}^{k-1} B_i{(W_{t_{i+1}} - W_{t_{i}})}\bigg)^2 =  \sum_{i=0}^{k-1} \mathbb{E}(B_i)^2({t_{i+1}} - {t_{i}}) =  \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s

Now suppose that \{H_n\}_{n \in \mathbb{N}} is a Cauchy sequence of progressively measurable simple functions in {\cal{L}}^2(\mu \times \mathbb{P}) then since the difference of two simple processes is again a simple process we can apply the Itô Isometry to deduce that

\displaystyle   \lim_{m,n \to \infty}\mathbb{E}\bigg(\int_0^\infty (H_n(s) - H_m(s))\,\mathrm{d}W(s)\bigg)^2 =  \lim_{m,n \to \infty}\mathbb{E}\int_0^\infty (H_n(s) - H_m(s))^2\,\mathrm{d}s =  0

In other words, \int_0^\infty H_n(s)\,\mathrm{d}W(s) is also Cauchy in {\cal{L}}^2(\mathbb{P}) and since this is complete, we can conclude that

\displaystyle   \int_0^\infty X(s)\mathrm{d}W(s) \triangleq \lim_{n \to \infty}\int_0^\infty X_n(s)\mathrm{d}W(s)

exists (in {\cal{L}}^2(\mathbb{P})). Uniqueness follows using the triangle inequality and the Itô isometry.

Notes

  1. We defer proving the definition also makes sense almost surely to another blog post.

  2. This approach seems fairly standard see for example Handel (2007) and Mörters et al. (2010).

  3. Rogers and Williams (2000) takes a more general approach.

  4. Protter (2004) takes a different approach by defining stochastic processes which are good integrators, a more abstract motivation than the one we give here.

  5. The requirement of progressive measurability can be relaxed.

Bibliography

Handel, Ramon von. 2007. “Stochastic Calculus, Filtering, and Stochastic Control (Lecture Notes).”

Mörters, P, Y Peres, O Schramm, and W Werner. 2010. Brownian motion. Cambridge Series on Statistical and Probabilistic Mathematics. Cambridge University Press. http://books.google.co.uk/books?id=e-TbA-dSrzYC.

Protter, P.E. 2004. Stochastic Integration and Differential Equations: Version 2.1. Applications of Mathematics. Springer. http://books.google.co.uk/books?id=mJkFuqwr5xgC.

Rogers, L.C.G., and D. Williams. 2000. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus. Cambridge Mathematical Library. Cambridge University Press. https://books.google.co.uk/books?id=bDQy-zoHWfcC.

Rejection Sampling

Introduction

Suppose you want to sample from the truncated normal distribution. One way to do this is to use rejection sampling. But if you do this naïvely then you will run into performance problems. The excellent Devroye (1986) who references Marsaglia (1964) gives an efficient rejection sampling scheme using the Rayleigh distribution. The random-fu package uses the Exponential distribution.

Performance

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE FlexibleContexts             #-}
> import Control.Monad
> import Data.Random
> import qualified Data.Random.Distribution.Normal as N
> import Data.Random.Source.PureMT
> import Control.Monad.State

Here’s the naïve implementation.

> naiveReject :: Double -> RVar Double
> naiveReject x = doit
>   where
>     doit = do
>       y <- N.stdNormal
>       if y < x
>         then doit
>         else return y

And here’s an implementation using random-fu.

> expReject :: Double -> RVar Double
> expReject x = N.normalTail x

Let’s try running both of them

> n :: Int
> n = 10000000
> lower :: Double
> lower = 2.0
> testExp :: [Double]
> testExp = evalState (replicateM n $ sample (expReject lower)) (pureMT 3)
> testNaive :: [Double]
> testNaive = evalState (replicateM n $ sample (naiveReject lower)) (pureMT 3)
> main :: IO ()
> main = do
>   print $ sum testExp
>   print $ sum testNaive

Let’s try building and running both the naïve and better tuned versions.

ghc -O2 CompareRejects.hs

As we can see below we get 59.98s and 4.28s, a compelling reason to use the tuned version. And the difference in performance will get worse the less of the tail we wish to sample from.

Tuned

2.3731610476911187e7
  11,934,195,432 bytes allocated in the heap
       5,257,328 bytes copied during GC
          44,312 bytes maximum residency (2 sample(s))
          21,224 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     23145 colls,     0 par    0.09s    0.11s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    4.19s  (  4.26s elapsed)
  GC      time    0.09s  (  0.11s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    4.28s  (  4.37s elapsed)

  %GC     time       2.2%  (2.6% elapsed)

  Alloc rate    2,851,397,967 bytes per MUT second

  Productivity  97.8% of total user, 95.7% of total elapsed

Naïve

2.3732073159369867e7
 260,450,762,656 bytes allocated in the heap
     111,891,960 bytes copied during GC
          85,536 bytes maximum residency (2 sample(s))
          76,112 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     512768 colls,     0 par    1.86s    2.24s     0.0000s    0.0008s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   58.12s  ( 58.99s elapsed)
  GC      time    1.86s  (  2.24s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    59.98s  ( 61.23s elapsed)

  %GC     time       3.1%  (3.7% elapsed)

  Alloc rate    4,481,408,869 bytes per MUT second

  Productivity  96.9% of total user, 94.9% of total elapsed

Bibliography

Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer-Verlag. http://books.google.co.uk/books?id=mEw\_AQAAIAAJ.

Marsaglia, G. 1964. “Generating a Variable from the Tail of the Normal Distribution.” J-Technometrics 6 (1): 101–2.

Stopping Times

Let S an T be stopping times and let the filtration on which they are defined be right continuous. Then

  1. S \land T = \min(S, T),
  2. S \lor T = \max(S, T),
  3. S + T and
  4. \alpha T

are stopping times where \alpha > 1.

For the first we have \{S \land T \leq t\} = \{S \leq t\} \cup \{T \leq t\} and both the latter are in {\mathcal{F}_t} by the definition of a stopping time.

Similarly for the second \{S \lor T \leq t\} = \{S \leq t\} \cap \{T \leq t\}.

For the fourth we have \{\alpha T \leq t\} = \{T \leq \frac{t}{\alpha}\} \in {\mathcal{F}}_{t / \alpha} \subset {\mathcal{F}}_{t} since \alpha > 1.

The third is slightly trickier. For \omega \in \Omega, S(\omega) +T(\omega) < t if and only if for some rational q, we have S(\omega) + T(\omega) < q < t. We can thus we can find r \in \mathbb{Q} such that S(\omega) < r < q - T(\omega). Writing s \triangleq q - r \in \mathbb{Q} we also have T(\omega) < q - r = s. Thus we have S(\omega) + T(\omega) < t if and only if there exist r, s \in \mathbb{Q} and r, s > 0 such that r + s < t and S(\omega) < r and T(\omega) < s. In other words

\displaystyle   \{S +T < t\} = \bigcup_{r, s \in \mathbb{Q}^+} (\{S < r\} \cap \{T < s\})

By right continuity (Protter 2004 Theorem 1) of the filtration, we know the terms on the right hand side are in {\mathcal{F}}_r \subset {\mathcal{F}}_t and {\mathcal{F}}_s \subset {\mathcal{F}}_t so that the whole right hand side is in {\mathcal{F}}_t. We thus know that the left hand side is in {\mathcal{F}}_t and using right continuity again that therefore S + T must be a stopping time.

Bibliography

Protter, P.E. 2004. Stochastic Integration and Differential Equations: Version 2.1. Applications of Mathematics. Springer. http://books.google.co.uk/books?id=mJkFuqwr5xgC.

Fun with (Extended Kalman) Filters

Summary

An extended Kalman filter in Haskell using type level literals and automatic differentiation to provide some guarantees of correctness.

Population Growth

Suppose we wish to model population growth of bees via the logistic equation

\displaystyle  \begin{aligned}  \dot{p} & = rp\Big(1 - \frac{p}{k}\Big)  \end{aligned}

We assume the growth rate r is unknown and drawn from a normal distribution {\cal{N}}(\mu_r, \sigma_r^2) but the carrying capacity k is known and we wish to estimate the growth rate by observing noisy values y_i of the population at discrete times t_0 = 0, t_1 = \Delta T, t_2 = 2\Delta T, \ldots. Note that p_t is entirely deterministic and its stochasticity is only as a result of the fact that the unknown parameter of the logistic equation is sampled from a normal distribution (we could for example be observing different colonies of bees and we know from the literature that bee populations obey the logistic equation and each colony will have different growth rates).

Haskell Preamble

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> {-# LANGUAGE BangPatterns                 #-}
> {-# LANGUAGE TypeOperators                #-}
> {-# LANGUAGE TypeFamilies                 #-}
> module FunWithKalman3 where
> import GHC.TypeLits
> import Numeric.LinearAlgebra.Static
> import Data.Maybe ( fromJust )
> import Numeric.AD
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> import Control.Monad.Loops

Logistic Equation

The logistic equation is a well known example of a dynamical system which has an analytic solution

\displaystyle  p = \frac{kp_0\exp rt}{k + p_0(\exp rt - 1)}

Here it is in Haskell

> logit :: Floating a => a -> a -> a -> a
> logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))

We observe a noisy value of population at regular time intervals (where \Delta T is the time interval)

\displaystyle  \begin{aligned}  p_i &= \frac{kp_0\exp r\Delta T i}{k + p_0(\exp r\Delta T i - 1)} \\  y_i &= p_i + \epsilon_i  \end{aligned}

Using the semi-group property of our dynamical system, we can re-write this as

\displaystyle  \begin{aligned}  p_i &= \frac{kp_{i-1}\exp r\Delta T}{k + p_{i-1}(\exp r\Delta T - 1)} \\  y_i &= p_i + \epsilon_i  \end{aligned}

To convince yourself that this re-formulation is correct, think of the population as starting at p_0; after 1 time step it has reached p_1 and and after two time steps it has reached p_2 and this ought to be the same as the point reached after 1 time step starting at p_1, for example

> oneStepFrom0, twoStepsFrom0, oneStepFrom1 :: Double
> oneStepFrom0  = logit 0.1 1.0 (1 * 0.1)
> twoStepsFrom0 = logit 0.1 1.0 (1 * 0.2)
> oneStepFrom1  = logit oneStepFrom0 1.0 (1 * 0.1)
ghci> twoStepsFrom0
  0.11949463171139338

ghci> oneStepFrom1
  0.1194946317113934

We would like to infer the growth rate not just be able to predict the population so we need to add another variable to our model.

\displaystyle  \begin{aligned}  r_i &= r_{i-1} \\  p_i &= \frac{kp_{i-1}\exp r_{i-1}\Delta T}{k + p_{i-1}(\exp r_{i-1}\Delta T - 1)} \\  y_i &= \begin{bmatrix}0 & 1\end{bmatrix}\begin{bmatrix}r_i \\ p_i\end{bmatrix} + \begin{bmatrix}0 \\ \epsilon_i\end{bmatrix}  \end{aligned}

Extended Kalman

This is almost in the form suitable for estimation using a Kalman filter but the dependency of the state on the previous state is non-linear. We can modify the Kalman filter to create the extended Kalman filter (EKF) by making a linear approximation.

Since the measurement update is trivially linear (even in this more general form), the measurement update step remains unchanged.

\displaystyle  \begin{aligned}  \boldsymbol{v}_i & \triangleq  \boldsymbol{y}_i - \boldsymbol{g}(\hat{\boldsymbol{x}}^\flat_i) \\  \boldsymbol{S}_i & \triangleq  \boldsymbol{G}_i \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{G}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\  \boldsymbol{K}_i & \triangleq \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{G}_i^\top\boldsymbol{S}^{-1}_i \\  \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \boldsymbol{K}_i\boldsymbol{v}_i \\  \hat{\boldsymbol{\Sigma}}_i &\triangleq \hat{\boldsymbol{\Sigma}}^\flat_i - \boldsymbol{K}_i\boldsymbol{S}_i\boldsymbol{K}^\top_i  \end{aligned}

By Taylor we have

\displaystyle  \boldsymbol{a}(\boldsymbol{x}) \approx \boldsymbol{a}(\boldsymbol{m}) + \boldsymbol{A}_{\boldsymbol{x}}(\boldsymbol{m})\delta\boldsymbol{x}

where \boldsymbol{A}_{\boldsymbol{x}}(\boldsymbol{m}) is the Jacobian of \boldsymbol{a} evaluated at \boldsymbol{m} (for the exposition of the extended filter we take \boldsymbol{a} to be vector valued hence the use of a bold font). We take \delta\boldsymbol{x} to be normally distributed with mean of 0 and ignore any difficulties there may be with using Taylor with stochastic variables.

Applying this at \boldsymbol{m} = \hat{\boldsymbol{x}}_{i-1} we have

\displaystyle  \boldsymbol{x}_i = \boldsymbol{a}(\hat{\boldsymbol{x}}_{i-1}) + \boldsymbol{A}_{\boldsymbol{x}}(\hat{\boldsymbol{x}}_{i-1})(\boldsymbol{x}_{i-1} - \hat{\boldsymbol{x}}_{i-1}) + \boldsymbol{\epsilon}_i

Using the same reasoning as we did as for Kalman filters and writing \boldsymbol{A}_{i-1} for \boldsymbol{A}_{\boldsymbol{x}}(\hat{\boldsymbol{x}}_{i-1}) we obtain

\displaystyle  \begin{aligned}  \hat{\boldsymbol{x}}^\flat_i &=  \boldsymbol{a}(\hat{\boldsymbol{x}}_{i-1}) \\  \hat{\boldsymbol{\Sigma}}^\flat_i &= \boldsymbol{A}_{i-1}  \hat{\boldsymbol{\Sigma}}_{i-1}  \boldsymbol{A}_{i-1}^\top  + \boldsymbol{\Sigma}^{(x)}_{i-1}  \end{aligned}

Haskell Implementation

Note that we pass in the Jacobian of the update function as a function itself in the case of the extended Kalman filter rather than the matrix representing the linear function as we do in the case of the classical Kalman filter.

> k, p0 :: Floating a => a
> k = 1.0
> p0 = 0.1 * k
> r, deltaT :: Floating a => a
> r = 10.0
> deltaT = 0.0005

Relating ad and hmatrix is somewhat unpleasant but this can probably be ameliorated by defining a suitable datatype.

> a :: R 2 -> R 2
> a rpPrev = rNew # pNew
>   where
>     (r, pPrev) = headTail rpPrev
>     rNew :: R 1
>     rNew = konst r
> 
>     (p,  _) = headTail pPrev
>     pNew :: R 1
>     pNew = fromList $ [logit p k (r * deltaT)]
> bigA :: R 2 -> Sq 2
> bigA rp = fromList $ concat $ j [r, p]
>   where
>     (r, ps) = headTail rp
>     (p,  _) = headTail ps
>     j = jacobian (\[r, p] -> [r, logit p k (r * deltaT)])

For some reason, hmatrix with static guarantees does not yet provide an inverse function for matrices.

> inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n
> inv m = fromJust $ linSolve m eye

Here is the extended Kalman filter itself. The type signatures on the expressions inside the function are not necessary but did help the implementor discover a bug in the mathematical derivation and will hopefully help the reader.

> outer ::  forall m n . (KnownNat m, KnownNat n,
>                         (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
>           R n -> Sq n ->
>           L m n -> Sq m ->
>           (R n -> R n) -> (R n -> Sq n) -> Sq n ->
>           [R m] ->
>           [(R n, Sq n)]
> outer muPrior sigmaPrior bigH bigSigmaY
>       littleA bigABuilder bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) ys
> 
>     update :: (R n, Sq n) -> R m -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) y =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
> 
>         v :: R m
>         v = y - (bigH #> xHatFlat)
> 
>         bigS :: Sq m
>         bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY
> 
>         bigK :: L n m
>         bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS)
> 
>         xHat :: R n
>         xHat = xHatFlat + bigK #> v
> 
>         bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK)
> 
>         bigA :: Sq n
>         bigA = bigABuilder xHat
> 
>         xHatFlatNew :: R n
>         xHatFlatNew = littleA xHat
> 
>         bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX

Now let us create some sample data.

> obsVariance :: Double
> obsVariance = 1e-2
> bigSigmaY :: Sq 1
> bigSigmaY = fromList [obsVariance]
> nObs :: Int
> nObs = 300
> singleSample :: Double -> RVarT (W.Writer [Double]) Double
> singleSample p0 = do
>   epsilon <- rvarT (Normal 0.0 obsVariance)
>   let p1 = logit p0 k (r * deltaT)
>   lift $ W.tell [p1 + epsilon]
>   return p1
> streamSample :: RVarT (W.Writer [Double]) Double
> streamSample = iterateM_ singleSample p0
> samples :: [Double]
> samples = take nObs $ snd $
>           W.runWriter (evalStateT (sample streamSample) (pureMT 3))

We created our data with a growth rate of

ghci> r
  10.0

but let us pretend that we have read the literature on growth rates of bee colonies and we have some big doubts about growth rates but we are almost certain about the size of the colony at t=0.

> muPrior :: R 2
> muPrior = fromList [5.0, 0.1]
> 
> sigmaPrior :: Sq 2
> sigmaPrior = fromList [ 1e2, 0.0
>                       , 0.0, 1e-10
>                       ]

We only observe the population and not the rate itself.

> bigH :: L 1 2
> bigH = fromList [0.0, 1.0]

Strictly speaking this should be 0 but this is close enough.

> bigSigmaX :: Sq 2
> bigSigmaX = fromList [ 1e-10, 0.0
>                      , 0.0, 1e-10
>                      ]

Now we can run our filter and watch it switch away from our prior belief as it accumulates more and more evidence.

> test :: [(R 2, Sq 2)]
> test = outer muPrior sigmaPrior bigH bigSigmaY
>        a bigA bigSigmaX (map (fromList . return) samples)

Haskell Vectors and Sampling from a Categorical Distribution

Introduction

Suppose we have a vector of weights which sum to 1.0 and we wish to sample n samples randomly according to these weights. There is a well known trick in Matlab / Octave using sampling from a uniform distribution.

num_particles = 2*10^7
likelihood = zeros(num_particles,1);
likelihood(:,1) = 1/num_particles;
[_,index] = histc(rand(num_particles,1),[0;cumsum(likelihood/sum(likelihood))]);
s = sum(index);

Using tic and toc this produces an answer with

Elapsed time is 10.7763 seconds.

Haskell

I could find no equivalent function in Haskell nor could I easily find a binary search function.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE BangPatterns                 #-}
> import System.Random.MWC
> import qualified Data.Vector.Unboxed as V
> import Control.Monad.ST
> import qualified Data.Vector.Algorithms.Search as S
> import Data.Bits
> n :: Int
> n = 2*10^7

Let’s create some random data. For a change let’s use mwc-random rather than random-fu.

> vs  :: V.Vector Double
> vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))

Again, I could find no equivalent of cumsum but we can write our own.

> weightsV, cumSumWeightsV :: V.Vector Double
> weightsV = V.replicate n (recip $ fromIntegral n)
> cumSumWeightsV = V.scanl (+) 0 weightsV

Binary search on a sorted vector is straightforward and a cumulative sum ensures that the vector is sorted.

> binarySearch :: (V.Unbox a, Ord a) =>
>                 V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
>   where
>     loop !l !u
>       | u <= l    = l
>       | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
>       where k = l + (u - l) `shiftR` 1
> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs

To see how well this performs, let’s sum the indices (of course, we wouldn’t do this in practice) as we did for the Matlab implementation.

> js :: V.Vector Int
> js = indices (V.tail cumSumWeightsV) vs
> main :: IO ()
> main = do
>   print $ V.foldl' (+) 0 js

Using +RTS -s we get

Total   time   10.80s  ( 11.06s elapsed)

which is almost the same as the Matlab version.

I did eventually find a binary search function in vector-algorithms and since one should not re-invent the wheel, let us try using it.

> indices' :: (V.Unbox a, Ord a) => V.Vector a -> V.Vector a -> V.Vector Int
> indices' sv x = runST $ do
>   st <- V.unsafeThaw (V.tail sv)
>   V.mapM (S.binarySearch st) x
> main' :: IO ()
> main' = do
>   print $  V.foldl' (+) 0 $ indices' cumSumWeightsV vs

Again using +RTS -s we get

Total   time   11.34s  ( 11.73s elapsed)

So the library version seems very slightly slower.

Fun with (Kalman) Filters Part II

Introduction

Suppose we have particle moving in at constant velocity in 1 dimension, where the velocity is sampled from a distribution. We can observe the position of the particle at fixed intervals and we wish to estimate its initial velocity. For generality, let us assume that the positions and the velocities can be perturbed at each interval and that our measurements are noisy.

A point of Haskell interest: using type level literals caught a bug in the mathematical description (one of the dimensions of a matrix was incorrect). Of course, this would have become apparent at run-time but proof checking of this nature is surely the future for mathematicians. One could conceive of writing an implementation of an algorithm or proof, compiling it but never actually running it purely to check that some aspects of the algorithm or proof are correct.

The Mathematical Model

We take the position as x_i and the velocity v_i:

\displaystyle  \begin{aligned}  x_i &= x_{i-1} + \Delta T v_{i-1} + \psi^{(x)}_i \\  v_i &= v_{i-1} + \psi^{(v)}_i \\  y_i &= a_i x_i + \upsilon_i  \end{aligned}

where \psi^{(x)}_i, \psi^{(v)}_i and \upsilon_i are all IID normal with means of 0 and variances of \sigma^2_x, \sigma^2_v and \sigma^2_y

We can re-write this as

\displaystyle  \begin{aligned}  \boldsymbol{x}_i &= \boldsymbol{A}_{i-1}\boldsymbol{x}_{i-1} + \boldsymbol{\psi}_{i-1} \\  \boldsymbol{y}_i &= \boldsymbol{H}_i\boldsymbol{x}_i + \boldsymbol{\upsilon}_i  \end{aligned}

where

\displaystyle  \boldsymbol{A}_i =  \begin{bmatrix}  1 & \Delta T\\  0 & 1\\  \end{bmatrix}  ,\quad  \boldsymbol{H}_i =  \begin{bmatrix}  a_i & 0 \\  \end{bmatrix}  ,\quad  \boldsymbol{\psi}_i \sim {\cal{N}}\big(0,\boldsymbol{\Sigma}^{(x)}_i\big)  ,\quad  \boldsymbol{\Sigma}^{(x)}_i =  \begin{bmatrix}  \sigma^2_{x} & 0\\  0 & \sigma^2_{v} \\  \end{bmatrix}  ,\quad  \boldsymbol{\upsilon}_i \sim {\cal{N}}\big(0,\boldsymbol{\Sigma}^{(y)}_i\big)  ,\quad  \boldsymbol{\Sigma}^{(y)}_i =  \begin{bmatrix}  \sigma^2_{z} \\  \end{bmatrix}

Let us denote the mean and variance of \boldsymbol{X}_i\,\vert\,\boldsymbol{Y}_{i-1} as \hat{\boldsymbol{x}}^\flat_i and \hat{\boldsymbol{\Sigma}}^\flat_i respectively and note that

\displaystyle  \begin{aligned}  {\boldsymbol{Y}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{H}_i\boldsymbol{X}_i\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Upsilon}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{H}_i\boldsymbol{X}_i\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Upsilon}_i}  \end{aligned}

Since {\boldsymbol{X}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} and {\boldsymbol{Y}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} are jointly Gaussian and recalling that ({\hat{\boldsymbol{\Sigma}}^\flat_i})^\top = \hat{\boldsymbol{\Sigma}}^\flat_i as covariance matrices are symmetric, we can calculate their mean and covariance matrix as

\displaystyle  \begin{bmatrix}  \hat{\boldsymbol{x}}^\flat_i \\  \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i  \end{bmatrix}  ,\quad  \begin{bmatrix}  \hat{\boldsymbol{\Sigma}}^\flat_i & \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top \\  \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i & \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\  \end{bmatrix}

We can now use standard formulæ which say if

\displaystyle  \begin{bmatrix}  \boldsymbol{X} \\  \boldsymbol{Y}  \end{bmatrix}  \sim  {\cal{N}}  \begin{bmatrix}  \begin{bmatrix}  \boldsymbol{\mu}_x \\  \boldsymbol{\mu}_y  \end{bmatrix}  &  ,  &  \begin{bmatrix}  \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\  \boldsymbol{\Sigma}^\top_{xy} & \boldsymbol{\Sigma}_y  \end{bmatrix}  \end{bmatrix}

then

\displaystyle  \boldsymbol{X}\,\vert\,\boldsymbol{Y}=\boldsymbol{y} \sim {{\cal{N}}\big( \boldsymbol{\mu}_x + \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1}_y(\boldsymbol{y} - \boldsymbol{\mu}_y) , \boldsymbol{\Sigma}_x - \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1}_y\boldsymbol{\Sigma}^\top_{xy}\big)}

and apply this to

\displaystyle  (\boldsymbol{X}_i\,\vert\, \boldsymbol{Y}_{i-1})\,\vert\,(\boldsymbol{Y}_i\,\vert\, \boldsymbol{Y}_{i-1})

to give

\displaystyle  \boldsymbol{X}_i\,\vert\, \boldsymbol{Y}_{i} = \boldsymbol{y}_i  \sim  {{\cal{N}}\big( \hat{\boldsymbol{x}}^\flat_i + \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top  \big(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i\big)^{-1}  (\boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i) , \hat{\boldsymbol{\Sigma}}^\flat_i - \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i)^{-1}\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i\big)}

This is called the measurement update; more explicitly

\displaystyle  \begin{aligned}  \hat{\boldsymbol{x}}^i &\triangleq  \hat{\boldsymbol{x}}^\flat_i +  \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top  \big(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i\big)^{-1}  (\boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i) \\  \hat{\boldsymbol{\Sigma}}_i &\triangleq  {\hat{\boldsymbol{\Sigma}}^\flat_i - \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i)^{-1}\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i}  \end{aligned}

Sometimes the measurement residual \boldsymbol{v}_i, the measurement prediction covariance \boldsymbol{S}_i and the filter gain \boldsymbol{K}_i are defined and the measurement update is written as

\displaystyle  \begin{aligned}  \boldsymbol{v}_i & \triangleq  \boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i \\  \boldsymbol{S}_i & \triangleq  \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\  \boldsymbol{K}_i & \triangleq \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top\boldsymbol{S}^{-1}_i \\  \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \boldsymbol{K}_i\boldsymbol{v}_i \\  \hat{\boldsymbol{\Sigma}}_i &\triangleq \hat{\boldsymbol{\Sigma}}^\flat_i - \boldsymbol{K}_i\boldsymbol{S}_i\boldsymbol{K}^\top_i  \end{aligned}

We further have that

\displaystyle  \begin{aligned}  {\boldsymbol{X}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{A}_i\boldsymbol{X}_{i-1}\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Psi}_{i-1}}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{A}_i\boldsymbol{X}_{i-1}\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Psi}_i}  \end{aligned}

We thus obtain the Kalman filter prediction step:

\displaystyle  \begin{aligned}  \hat{\boldsymbol{x}}^\flat_i &=  \boldsymbol{A}_{i-1}\hat{\boldsymbol{x}}_{i-1} \\  \hat{\boldsymbol{\Sigma}}^\flat_i &= \boldsymbol{A}_{i-1}  \hat{\boldsymbol{\Sigma}}_{i-1}  \boldsymbol{A}_{i-1}^\top  + \boldsymbol{\Sigma}^{(x)}_{i-1}  \end{aligned}

Further information can be found in (Boyd 2008), (Kleeman 1996) and (Särkkä 2013).

A Haskell Implementation

The hmatrix now uses type level literals via the DataKind extension in ghc to enforce compatibility of matrix and vector operations at the type level. See here for more details. Sadly a bug in the hmatrix implementation means we can’t currently use this excellent feature and we content ourselves with comments describing what the types would be were it possible to use it.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> module FunWithKalmanPart1a where
> import Numeric.LinearAlgebra.HMatrix hiding ( outer )
> import Data.Random.Source.PureMT
> import Data.Random hiding ( gamma )
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> import Control.Monad.Loops

Let us make our model almost deterministic but with noisy observations.

> stateVariance :: Double
> stateVariance = 1e-6
> obsVariance :: Double
> obsVariance = 1.0

And let us start with a prior normal distribution with a mean position and velocity of 0 with moderate variances and no correlation.

> -- muPrior :: R 2
> muPrior :: Vector Double
> muPrior = vector [0.0, 0.0]
> -- sigmaPrior :: Sq 2
> sigmaPrior :: Matrix Double
> sigmaPrior = (2 >< 2) [ 1e1,   0.0
>                       , 0.0,   1e1
>                       ]

We now set up the parameters for our model as outlined in the preceeding section.

> deltaT :: Double
> deltaT = 0.001
> -- bigA :: Sq 2
> bigA :: Matrix Double
> bigA = (2 >< 2) [ 1, deltaT
>                 , 0,      1
>                 ]
> a :: Double
> a = 1.0
> -- bigH :: L 1 2
> bigH :: Matrix Double
> bigH = (1 >< 2) [ a, 0
>                 ]
> -- bigSigmaY :: Sq 1
> bigSigmaY :: Matrix Double
> bigSigmaY = (1 >< 1) [ obsVariance ]
> -- bigSigmaX :: Sq 2
> bigSigmaX :: Matrix Double
> bigSigmaX = (2 >< 2) [ stateVariance, 0.0
>                      , 0.0,           stateVariance
>                      ]

The implementation of the Kalman filter using the hmatrix package is straightforward.

> -- outer ::  forall m n . (KnownNat m, KnownNat n) =>
> --           R n -> Sq n -> L m n -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer :: Vector Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> [Vector Double]
>          -> [(Vector Double, Matrix Double)]
> outer muPrior sigmaPrior bigH bigSigmaY bigA bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) ys
> 
>     -- update :: (R n, Sq n) -> R m -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) y =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>         -- v :: R m
>         v = y - bigH #> xHatFlat
>         -- bigS :: Sq m
>         bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY
>         -- bigK :: L n m
>         bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS)
>         -- xHat :: R n
>         xHat = xHatFlat + bigK #> v
>         -- bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK)
>         -- xHatFlatNew :: R n
>         xHatFlatNew = bigA #> xHat
>         -- bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX

We create some ranodm data using our model parameters.

> singleSample ::(Double, Double) ->
>                RVarT (W.Writer [(Double, (Double, Double))]) (Double, Double)
> singleSample (xPrev, vPrev) = do
>   psiX <- rvarT (Normal 0.0 stateVariance)
>   let xNew = xPrev + deltaT * vPrev + psiX
>   psiV <- rvarT (Normal 0.0 stateVariance)
>   let vNew = vPrev + psiV
>   upsilon <- rvarT (Normal 0.0 obsVariance)
>   let y = a * xNew + upsilon
>   lift $ W.tell [(y, (xNew, vNew))]
>   return (xNew, vNew)
> streamSample :: RVarT (W.Writer [(Double, (Double, Double))]) (Double, Double)
> streamSample = iterateM_ singleSample (1.0, 1.0)
> samples :: ((Double, Double), [(Double, (Double, Double))])
> samples = W.runWriter (evalStateT (sample streamSample) (pureMT 2))

Here are the actual values of the randomly generated positions.

> actualXs :: [Double]
> actualXs = map (fst . snd) $ take nObs $ snd samples
> test :: [(Vector Double, Matrix Double)]
> test = outer muPrior sigmaPrior bigH bigSigmaY bigA bigSigmaX
>        (map (\x -> vector [x]) $ map fst $ snd samples)

And using the Kalman filter we can estimate the positions.

> estXs :: [Double]
> estXs = map (!!0) $ map toList $ map fst $ take nObs test
> nObs :: Int
> nObs = 1000

And we can see that the estimates track the actual positions quite nicely.

Of course we really wanted to estimate the velocity.

> actualVs :: [Double]
> actualVs = map (snd . snd) $ take nObs $ snd samples
> estVs :: [Double]
> estVs = map (!!1) $ map toList $ map fst $ take nObs test

Bibliography

Boyd, Stephen. 2008. “EE363 Linear Dynamical Systems.” http://stanford.edu/class/ee363.

Kleeman, Lindsay. 1996. “Understanding and Applying Kalman Filtering.” In Proceedings of the Second Workshop on Perceptive Systems, Curtin University of Technology, Perth Western Australia (25-26 January 1996).

Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. Vol. 3. Cambridge University Press.

Fun with (Kalman) Filters Part I

Suppose we wish to estimate the mean of a sample drawn from a normal distribution. In the Bayesian approach, we know the prior distribution for the mean (it could be a non-informative prior) and then we update this with our observations to create the posterior, the latter giving us improved information about the distribution of the mean. In symbols

\displaystyle p(\theta \,\vert\, x) \propto p(x \,\vert\, \theta)p(\theta)

Typically, the samples are chosen to be independent, and all of the data is used to perform the update but, given independence, there is no particular reason to do that, updates can performed one at a time and the result is the same; nor is the order of update important. Being a bit imprecise, we have

\displaystyle p(z \,\vert\, x, y) = p(z, x, y)p(x, y) = p(z, x, y)p(x)p(y) = p((z \,\vert\, x) \,\vert\, y) = p((z \,\vert\, y) \,\vert\, x)

The standard notation in Bayesian statistics is to denote the parameters of interest as \theta \in \mathbb{R}^p and the observations as x \in \mathbb{R}^n. For reasons that will become apparent in later blog posts, let us change notation and label the parameters as x and the observations as y.

Let us take a very simple example of a prior X \sim {\cal{N}}(0, \sigma^2) where \sigma^2 is known and then sample from a normal distribution with mean x and variance for the i-th sample c_i^2 where c_i is known (normally we would not know the variance but adding this generality would only clutter the exposition unnecessarily).

\displaystyle p(y_i \,\vert\, x) = \frac{1}{\sqrt{2\pi c_i^2}}\exp\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg)

The likelihood is then

\displaystyle p(\boldsymbol{y} \,\vert\, x) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi c_i^2}}\exp\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg)

As we have already noted, instead of using this with the prior to calculate the posterior, we can update the prior with each observation separately. Suppose that we have obtained the posterior given i - 1 samples (we do not know this is normally distributed yet but we soon will):

\displaystyle p(x \,\vert\, y_1,\ldots,y_{i-1}) = {\cal{N}}(\hat{x}_{i-1}, \hat{\sigma}^2_{i-1})

Then we have

\displaystyle \begin{aligned} p(x \,\vert\, y_1,\ldots,y_{i}) &\propto p(y_i \,\vert\, x)p(x \,\vert\, y_1,\ldots,y_{i-1}) \\ &\propto \exp-\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg) \exp-\bigg(\frac{(x - \hat{x}_{i-1})^2}{2\hat{\sigma}_{i-1}^2}\bigg) \\ &\propto \exp-\Bigg(\frac{x^2}{c_i^2} - \frac{2xy_i}{c_i^2} + \frac{x^2}{\hat{\sigma}_{i-1}^2} - \frac{2x\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}\Bigg) \\ &\propto \exp-\Bigg( x^2\Bigg(\frac{1}{c_i^2} + \frac{1}{\hat{\sigma}_{i-1}^2}\Bigg) - 2x\Bigg(\frac{y_i}{c_i^2} + \frac{\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}\Bigg)\Bigg) \end{aligned}

Writing

\displaystyle \frac{1}{\hat{\sigma}_{i}^2} \triangleq \frac{1}{c_i^2} + \frac{1}{\hat{\sigma}_{i-1}^2}

and then completing the square we also obtain

\displaystyle \frac{\hat{x}_{i}}{\hat{\sigma}_{i}^2} \triangleq \frac{y_i}{c_i^2} + \frac{\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}

More Formally

Now let’s be a bit more formal about conditional probability and use the notation of \sigma-algebras to define {\cal{F}}_i = \sigma\{Y_1,\ldots, Y_i\} and M_i \triangleq \mathbb{E}(X \,\vert\, {\cal{F}}_i) where Y_i = X + \epsilon_i, X is as before and \epsilon_i \sim {\cal{N}}(0, c_k^2). We have previously calculated that M_i = \hat{x}_i and that {\cal{E}}((X - M_i)^2 \,\vert\, Y_1, \ldots Y_i) = \hat{\sigma}_{i}^2 and the tower law for conditional probabilities then allows us to conclude {\cal{E}}((X - M_i)^2) = \hat{\sigma}_{i}^2. By Jensen’s inequality, we have

\displaystyle {\cal{E}}(M_i^2) = {\cal{E}}({\cal{E}}(X \,\vert\, {\cal{F}}_i)^2)) \leq {\cal{E}}({\cal{E}}(X^2 \,\vert\, {\cal{F}}_i))) = {\cal{E}}(X^2) = \sigma^2

Hence M is bounded in L^2 and therefore converges in L^2 and almost surely to M_\infty \triangleq {\cal{E}}(X \,\vert\, {\cal{F}}_\infty). The noteworthy point is that if M_\infty = X if and only if \hat{\sigma}_i converges to 0. Explicitly we have

\displaystyle \frac{1}{\hat{\sigma}_i^2} = \frac{1}{\sigma^2} + \sum_{k=1}^i\frac{1}{c_k^2}

which explains why we took the observations to have varying and known variances. You can read more in Williams’ book (Williams 1991).

A Quick Check

We have reformulated our estimation problem as a very simple version of the celebrated Kalman filter. Of course, there are much more interesting applications of this but for now let us try “tracking” the sample from the random variable.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> module FunWithKalmanPart1 (
>     obs
>   , nObs
>   , estimates
>   , uppers
>   , lowers
>   ) where
> 
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> var, cSquared :: Double
> var       = 1.0
> cSquared  = 1.0
> 
> nObs :: Int
> nObs = 100
> createObs :: RVar (Double, [Double])
> createObs = do
>   x <- rvar (Normal 0.0 var)
>   ys <- replicateM nObs $ rvar (Normal x cSquared)
>   return (x, ys)
> 
> obs :: (Double, [Double])
> obs = evalState (sample createObs) (pureMT 2)
> 
> updateEstimate :: (Double, Double) -> (Double, Double) -> (Double, Double)
> updateEstimate (xHatPrev, varPrev) (y, cSquared) = (xHatNew, varNew)
>   where
>     varNew  = recip (recip varPrev + recip cSquared)
>     xHatNew = varNew * (y / cSquared + xHatPrev / varPrev)
> 
> estimates :: [(Double, Double)]
> estimates = scanl updateEstimate (y, cSquared) (zip ys (repeat cSquared))
>   where
>     y  = head $ snd obs
>     ys = tail $ snd obs
> 
> uppers :: [Double]
> uppers = map (\(x, y) -> x + 3 * (sqrt y)) estimates
> 
> lowers :: [Double]
> lowers = map (\(x, y) -> x - 3 * (sqrt y)) estimates

Bibliography

Williams, David. 1991. Probability with Martingales. Cambridge University Press.

Hölder’s and Minkowski’s Inequalities

Introduction

I have seen Hölder’s inequality and Minkowski’s inequality proved in several ways but this seems the most perspicuous (to me at any rate).

Young’s Inequality

If a, b \ge 0 and p,q \ge 1 such that

\displaystyle   \frac{1}{p} + \frac{1}{q} = 1

then

\displaystyle   ab \le \frac{a^p}{p} + \frac{b^q}{q}

A p and q satisfying the premise are known as conjugate indices.

Proof

Since \log is convex we have

\displaystyle   t\log{x} + (1 - t)\log{y} \le \log{(tx + (1 - t)y)}

Substituting in appropriate values gives

\displaystyle   \frac{1}{p}\log{a^p} + \frac{1}{q}\log{b^q} \le \log{\bigg(\frac{a^p}{p} + \frac{b^q}{q}\bigg)}

or

\displaystyle   \log{a} + \log{b} \le \log{\bigg(\frac{a^p}{p} + \frac{b^q}{q}\bigg)}

Now take exponents.

\blacksquare

Hölders’s Inequality

Let p and q be conjugate indices with 1 < p < \infty and let f \in L^p(\Omega) and g \in L^q(\Omega) then fg \in L^1(\Omega) and

\displaystyle   \|fg\|_{L^1} \le \|f\|_{L^p}\|g\|_{L^q}

Proof

By Young’s inequality

\displaystyle   \int_\Omega \frac{|f(x)|}{\|f\|_{L^p}} \frac{|g(x)|}{\|g\|_{L^q}} \le  \int_\Omega \frac{1}{p}\frac{|f(x)|^p}{\|f\|_{L^p}^p} +              \frac{1}{q}\frac{|g(x)|^q}{\|g\|_{L^q}^q} =  \frac{1}{p} + \frac{1}{q} = 1

\blacksquare

By applying a counting measure to \Omega we also obtain

\displaystyle   \sum |x_i y_i| \le \big(\sum |x_i|^p\big)^{1/p} \big(\sum |y_i|^q\big)^{1/q}

Minkowski’s Inequality

\displaystyle   \|f + g\|_{L^p} \le \|f\|_{L^p} + \|g\|_{L^p}

Proof

By Hölder’s inequality

\displaystyle   \int_\Omega |f + g|^p \le \int_\Omega |f||f + g|^{p-1} +                            \int_\Omega |g||f + g|^{p-1}                            \le \|f\|_{L^p}A + \|g\|_{L^p}A

where

\displaystyle   A = \||f + g|^{p-1}\|_{L^q} = \big(\int_\Omega |f(x) + g(x)|^p\big)^{1/q}

and A is finite since L^p is a vector space.

\blacksquare