In their paper Betancourt et al. (2014), the authors give a corollary which starts with the phrase “Because the manifold is paracompact”. It wasn’t immediately clear why the manifold was paracompact or indeed what paracompactness meant although it was clearly something like compactness which means that every cover has a finite sub-cover.
It turns out that every manifold is paracompact and that this is intimately related to partitions of unity.
Most of what I have written below is taken from some hand-written anonymous lecture notes I found by chance in the DPMMS library in Cambridge University. To whomever wrote them: thank you very much.
Let be an open cover of a smooth manifold . A partition of unity on M, subordinate to the cover is a finite collection of smooth functions
where for some such that
and for each there exists such that
We don’t yet know partitions of unity exist.
First define
Techniques of classical analysis easily show that is smooth ( is the only point that might be in doubt and it can be checked from first principles that for all ).
Next define
Finally we can define by . This has the properties
Now take a point centred in a chart so that, without loss of generality, (we can always choose so that the open ball and then define another chart with ).
Define the images of the open and closed balls of radius and respectively
and further define bump functions
Then is smooth and its support lies in .
By compactness, the open cover has a finite subcover . Now define
by
Then is smooth, and . Thus is the required partition of unity.
Because is a manifold, it has a countable basis and for any point , there must exist with . Choose one of these and call it . This gives a countable cover of by such sets.
Now define
where, since is compact, is a finite subcover.
And further define
where again, since is compact, is a finite subcover.
Now define
Then is compact, is open and . Furthermore, and only intersects with and .
Given any open cover of , each can be covered by a finite number of open sets in contained in some member of . Thus every point in can be covered by at most a finite number of sets from and and which are contained in some member of . This is a locally finite refinement of and which is precisely the definition of paracompactness.
To produce a partition of unity we define bump functions as above on this locally finite cover and note that locally finite implies that is well defined. Again, as above, define
to get the required result.
Betancourt, M. J., Simon Byrne, Samuel Livingstone, and Mark Girolami. 2014. “The Geometric Foundations of Hamiltonian Monte Carlo,” October, 45. http://arxiv.org/abs/1410.5110.
In proposition 58 Chapter 1 in the excellent book O’Neill (1983), the author demonstrates that the Lie derivative of one vector field with respect to another is the same as the Lie bracket (of the two vector fields) although he calls the Lie bracket just bracket and does not define the Lie derivative preferring just to use its definition with giving it a name. The proof relies on a prior result where he shows a co-ordinate system at a point can be given to a vector field for which so that .
Here’s a proof seems clearer (to me at any rate) and avoids having to distinguish the case wehere the vector field is zero or non-zero. These notes give a similar proof but, strangely for undergraduate level, elide some of the details.
Let be a smooth mapping and let be a tensor with then define the pullback of by to be
For a tensor the pullback is defined to be .
Standard manipulations show that is a smooth (covariant) tensor field and that is -linear and that .
Let be a diffeomorphism and a vector field on we define the pullback of this field to be
Note that the pullback of a vector field only exists in the case where is a diffeomorphism; in contradistinction, in the case of pullbacks of purely covariant tensors, the pullback always exists.
For the proof below, we only need the pullback of functions and vector fields; the pullback for tensors with is purely to give a bit of context.
From O’Neill (1983) Chapter 1 Definition 20, let be a smooth mapping. Vector fields on and on are –related written if and only if .
By Lemma 21 Chapter 1 of O’Neill (1983), and are -related if and only if .
Recalling that and since
we see that the fields and are -related: . Thus we can apply the Lemma.
Although we don’t need this, we can express the immediately above equivalence in a way similar to the rule for covariant tensors
First let’s calculate the Lie derivative of a function with respect to a vector field where is its flow
Analogously defining the Lie derivative of with respect to
we have
Since we have
Thus
as required.
O’Neill, B. 1983. Semi-Riemannian Geometry with Applications to Relativity, 103. Pure and Applied Mathematics. Elsevier Science. https://books.google.com.au/books?id=CGk1eRSjFIIC.
The equation of motion for a pendulum of unit length subject to Gaussian white noise is
We can discretize this via the usual Euler method
where and
The explanation of the precise form of the covariance matrix will be the subject of another blog post; for the purpose of exposition of forward filtering / backward smoothing, this detail is not important.
Assume that we can only measure the horizontal position of the pendulum and further that this measurement is subject to error so that
where .
Particle Filtering can give us an estimate of where the pendulum is and its velocity using all the observations up to that point in time. But now suppose we have observed the pendulum for a fixed period of time then at times earlier than the time at which we stop our observations we now have observations in the future as well as in the past. If we can somehow take account of these future observations then we should be able to improve our estimate of where the pendulum was at any given point in time (as well as its velocity). Forward Filtering / Backward Smoothing is a technique for doing this.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE MultiParamTypeClasses #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE ExplicitForAll #-}
> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE FlexibleInstances #-}
> {-# LANGUAGE MultiParamTypeClasses #-}
> {-# LANGUAGE FlexibleContexts #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE BangPatterns #-}
> {-# LANGUAGE GeneralizedNewtypeDeriving #-}
> {-# LANGUAGE TemplateHaskell #-}
> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE DeriveGeneric #-}
> module PendulumSamples ( pendulumSamples
> , pendulumSamples'
> , testFiltering
> , testSmoothing
> , testFilteringG
> , testSmoothingG
> ) where
> import Data.Random hiding ( StdNormal, Normal )
> import Data.Random.Source.PureMT ( pureMT )
> import Control.Monad.State ( evalState, replicateM )
> import qualified Control.Monad.Loops as ML
> import Control.Monad.Writer ( tell, WriterT, lift,
> runWriterT
> )
> import Numeric.LinearAlgebra.Static
> ( R, vector, Sym,
> headTail, matrix, sym,
> diag
> )
> import GHC.TypeLits ( KnownNat )
> import MultivariateNormal ( MultivariateNormal(..) )
> import qualified Data.Vector as V
> import Data.Bits ( shiftR )
> import Data.List ( transpose )
> import Control.Parallel.Strategies
> import GHC.Generics (Generic)
Let’s first plot some typical trajectories of the pendulum.
> deltaT, g :: Double
> deltaT = 0.01
> g = 9.81
> type PendulumState = R 2
> type PendulumObs = R 1
> pendulumSample :: MonadRandom m =>
> Sym 2 ->
> Sym 1 ->
> PendulumState ->
> m (Maybe ((PendulumState, PendulumObs), PendulumState))
> pendulumSample bigQ bigR xPrev = do
> let x1Prev = fst $ headTail xPrev
> x2Prev = fst $ headTail $ snd $ headTail xPrev
> eta <- sample $ rvar (MultivariateNormal 0.0 bigQ)
> let x1= x1Prev + x2Prev * deltaT
> x2 = x2Prev - g * (sin x1Prev) * deltaT
> xNew = vector [x1, x2] + eta
> x1New = fst $ headTail xNew
> epsilon <- sample $ rvar (MultivariateNormal 0.0 bigR)
> let yNew = vector [sin x1New] + epsilon
> return $ Just ((xNew, yNew), xNew)
Let’s try plotting some samples when we are in the linear region with which we are familiar from school .
In this case we expect the horizontal displacement to be approximately equal to the angle of displacement and thus the observations to be symmetric about the actuals.
> bigQ :: Sym 2
> bigQ = sym $ matrix bigQl
> qc1 :: Double
> qc1 = 0.0001
> bigQl :: [Double]
> bigQl = [ qc1 * deltaT^3 / 3, qc1 * deltaT^2 / 2,
> qc1 * deltaT^2 / 2, qc1 * deltaT
> ]
> bigR :: Sym 1
> bigR = sym $ matrix [0.0001]
> m0 :: PendulumState
> m0 = vector [0.01, 0]
> pendulumSamples :: [(PendulumState, PendulumObs)]
> pendulumSamples = evalState (ML.unfoldrM (pendulumSample bigQ bigR) m0) (pureMT 17)
But if we work in a region in which linearity breaks down then the observations are no longer symmetrical about the actuals.
> bigQ' :: Sym 2
> bigQ' = sym $ matrix bigQl'
> qc1' :: Double
> qc1' = 0.01
> bigQl' :: [Double]
> bigQl' = [ qc1' * deltaT^3 / 3, qc1' * deltaT^2 / 2,
> qc1' * deltaT^2 / 2, qc1' * deltaT
> ]
> bigR' :: Sym 1
> bigR' = sym $ matrix [0.1]
> m0' :: PendulumState
> m0' = vector [1.6, 0]
> pendulumSamples' :: [(PendulumState, PendulumObs)]
> pendulumSamples' = evalState (ML.unfoldrM (pendulumSample bigQ' bigR') m0') (pureMT 17)
We do not give the theory behind particle filtering. The interested reader can either consult Särkkä (2013) or wait for a future blog post on the subject.
> nParticles :: Int
> nParticles = 500
The usual Bayesian update step.
> type Particles a = V.Vector a
> oneFilteringStep ::
> MonadRandom m =>
> (Particles a -> m (Particles a)) ->
> (Particles a -> Particles b) ->
> (b -> b -> Double) ->
> Particles a ->
> b ->
> WriterT [Particles a] m (Particles a)
> oneFilteringStep stateUpdate obsUpdate weight statePrevs obs = do
> tell [statePrevs]
> stateNews <- lift $ stateUpdate statePrevs
> let obsNews = obsUpdate stateNews
> let weights = V.map (weight obs) obsNews
> cumSumWeights = V.tail $ V.scanl (+) 0 weights
> totWeight = V.last cumSumWeights
> vs <- lift $ V.replicateM nParticles (sample $ uniform 0.0 totWeight)
> let js = indices cumSumWeights vs
> stateTildes = V.map (stateNews V.!) js
> return stateTildes
The system state and observable.
> data SystemState a = SystemState { x1 :: a, x2 :: a }
> deriving (Show, Generic)
> instance NFData a => NFData (SystemState a)
> newtype SystemObs a = SystemObs { y1 :: a }
> deriving Show
To make the system state update a bit more readable, let’s introduce some lifted arithmetic operators.
> (.+), (.*), (.-) :: (Num a) => V.Vector a -> V.Vector a -> V.Vector a
> (.+) = V.zipWith (+)
> (.*) = V.zipWith (*)
> (.-) = V.zipWith (-)
The state update itself
> stateUpdate :: Particles (SystemState Double) ->
> Particles (SystemState Double)
> stateUpdate xPrevs = V.zipWith SystemState x1s x2s
> where
> ix = V.length xPrevs
>
> x1Prevs = V.map x1 xPrevs
> x2Prevs = V.map x2 xPrevs
>
> deltaTs = V.replicate ix deltaT
> gs = V.replicate ix g
> x1s = x1Prevs .+ (x2Prevs .* deltaTs)
> x2s = x2Prevs .- (gs .* (V.map sin x1Prevs) .* deltaTs)
and its noisy version.
> stateUpdateNoisy :: MonadRandom m =>
> Sym 2 ->
> Particles (SystemState Double) ->
> m (Particles (SystemState Double))
> stateUpdateNoisy bigQ xPrevs = do
> let xs = stateUpdate xPrevs
>
> x1s = V.map x1 xs
> x2s = V.map x2 xs
>
> let ix = V.length xPrevs
> etas <- replicateM ix $ sample $ rvar (MultivariateNormal 0.0 bigQ)
>
> let eta1s, eta2s :: V.Vector Double
> eta1s = V.fromList $ map (fst . headTail) etas
> eta2s = V.fromList $ map (fst . headTail . snd . headTail) etas
>
> return (V.zipWith SystemState (x1s .+ eta1s) (x2s .+ eta2s))
The function which maps the state to the observable.
> obsUpdate :: Particles (SystemState Double) ->
> Particles (SystemObs Double)
> obsUpdate xs = V.map (SystemObs . sin . x1) xs
And finally a function to calculate the weight of each particle given an observation.
> weight :: forall a n . KnownNat n =>
> (a -> R n) ->
> Sym n ->
> a -> a -> Double
> weight f bigR obs obsNew = pdf (MultivariateNormal (f obsNew) bigR) (f obs)
The variance of the prior.
> bigP :: Sym 2
> bigP = sym $ diag 0.1
Generate our ensemble of particles chosen from the prior,
> initParticles :: MonadRandom m =>
> m (Particles (SystemState Double))
> initParticles = V.replicateM nParticles $ do
> r <- sample $ rvar (MultivariateNormal m0' bigP)
> let x1 = fst $ headTail r
> x2 = fst $ headTail $ snd $ headTail r
> return $ SystemState { x1 = x1, x2 = x2}
run the particle filter,
> runFilter :: Int -> [Particles (SystemState Double)]
> runFilter nTimeSteps = snd $ evalState action (pureMT 19)
> where
> action = runWriterT $ do
> xs <- lift $ initParticles
> V.foldM
> (oneFilteringStep (stateUpdateNoisy bigQ') obsUpdate (weight f bigR'))
> xs
> (V.fromList $ map (SystemObs . fst . headTail . snd)
> (take nTimeSteps pendulumSamples'))
and extract the estimated position from the filter.
> testFiltering :: Int -> [Double]
> testFiltering nTimeSteps = map ((/ (fromIntegral nParticles)). sum . V.map x1)
> (runFilter nTimeSteps)
If we could calculate the marginal smoothing distributions then we might be able to sample from them. Using the Markov assumption of our model that is independent of given , we have
We observe that this is a (continuous state space) Markov process with a non-homogeneous transition function albeit one which goes backwards in time. Apparently for conditionally Gaussian linear state-space models, this is known as RTS, or Rauch-Tung-Striebel smoothing (Rauch, Striebel, and Tung (1965)).
According to Cappé (2008),
It appears to be efficient and stable in the long term (although no proof was available at the time the slides were presented).
It is not sequential (in particular, one needs to store all particle positions and weights).
It has numerical complexity proportional where is the number of particles.
We can use this to sample paths starting at time and working backwards. From above we have
where is some normalisation constant (Z for “Zustandssumme” – sum over states).
From particle filtering we know that
Thus
and we can sample from with probability
Recalling that we have re-sampled the particles in the filtering algorithm so that their weights are all and abstracting the state update and state density function, we can encode this update step in Haskell as
> oneSmoothingStep :: MonadRandom m =>
> (Particles a -> V.Vector a) ->
> (a -> a -> Double) ->
> a ->
> Particles a ->
> WriterT (Particles a) m a
> oneSmoothingStep stateUpdate
> stateDensity
> smoothingSampleAtiPlus1
> filterSamplesAti = do it
> where
> it = do
> let mus = stateUpdate filterSamplesAti
> weights = V.map (stateDensity smoothingSampleAtiPlus1) mus
> cumSumWeights = V.tail $ V.scanl (+) 0 weights
> totWeight = V.last cumSumWeights
> v <- lift $ sample $ uniform 0.0 totWeight
> let ix = binarySearch cumSumWeights v
> xnNew = filterSamplesAti V.! ix
> tell $ V.singleton xnNew
> return $ xnNew
To sample a complete path we start with a sample from the filtering distribution at at time (which is also the smoothing distribution)
> oneSmoothingPath :: MonadRandom m =>
> (Int -> V.Vector (Particles a)) ->
> (a -> Particles a -> WriterT (Particles a) m a) ->
> Int -> m (a, V.Vector a)
> oneSmoothingPath filterEstss oneSmoothingStep nTimeSteps = do
> let ys = filterEstss nTimeSteps
> ix <- sample $ uniform 0 (nParticles - 1)
> let xn = (V.head ys) V.! ix
> runWriterT $ V.foldM oneSmoothingStep xn (V.tail ys)
> oneSmoothingPath' :: (MonadRandom m, Show a) =>
> (Int -> V.Vector (Particles a)) ->
> (a -> Particles a -> WriterT (Particles a) m a) ->
> Int -> WriterT (Particles a) m a
> oneSmoothingPath' filterEstss oneSmoothingStep nTimeSteps = do
> let ys = filterEstss nTimeSteps
> ix <- lift $ sample $ uniform 0 (nParticles - 1)
> let xn = (V.head ys) V.! ix
> V.foldM oneSmoothingStep xn (V.tail ys)
Of course we need to run through the filtering distributions starting at
> filterEstss :: Int -> V.Vector (Particles (SystemState Double))
> filterEstss n = V.reverse $ V.fromList $ runFilter n
> testSmoothing :: Int -> Int -> [Double]
> testSmoothing m n = V.toList $ evalState action (pureMT 23)
> where
> action = do
> xss <- V.replicateM n $
> snd <$> (oneSmoothingPath filterEstss (oneSmoothingStep stateUpdate (weight h bigQ')) m)
> let yss = V.fromList $ map V.fromList $
> transpose $
> V.toList $ V.map (V.toList) $
> xss
> return $ V.map (/ (fromIntegral n)) $ V.map V.sum $ V.map (V.map x1) yss
By eye we can see we get a better fit
and calculating the mean square error for filtering gives against the mean square error for smoothing of ; this confirms the fit really is better as one would hope.
Let us continue with the same example but now assume that is unknown and that we wish to estimate it. Let us also assume that our apparatus is not subject to noise.
Again we have
But now when we discretize it we include a third variable
where
Again we assume that we can only measure the horizontal position of the pendulum so that
where .
> type PendulumStateG = R 3
> pendulumSampleG :: MonadRandom m =>
> Sym 3 ->
> Sym 1 ->
> PendulumStateG ->
> m (Maybe ((PendulumStateG, PendulumObs), PendulumStateG))
> pendulumSampleG bigQ bigR xPrev = do
> let x1Prev = fst $ headTail xPrev
> x2Prev = fst $ headTail $ snd $ headTail xPrev
> x3Prev = fst $ headTail $ snd $ headTail $ snd $ headTail xPrev
> eta <- sample $ rvar (MultivariateNormal 0.0 bigQ)
> let x1= x1Prev + x2Prev * deltaT
> x2 = x2Prev - g * (sin x1Prev) * deltaT
> x3 = x3Prev
> xNew = vector [x1, x2, x3] + eta
> x1New = fst $ headTail xNew
> epsilon <- sample $ rvar (MultivariateNormal 0.0 bigR)
> let yNew = vector [sin x1New] + epsilon
> return $ Just ((xNew, yNew), xNew)
> pendulumSampleGs :: [(PendulumStateG, PendulumObs)]
> pendulumSampleGs = evalState (ML.unfoldrM (pendulumSampleG bigQg bigRg) mG) (pureMT 29)
> data SystemStateG a = SystemStateG { gx1 :: a, gx2 :: a, gx3 :: a }
> deriving Show
The state update itself
> stateUpdateG :: Particles (SystemStateG Double) ->
> Particles (SystemStateG Double)
> stateUpdateG xPrevs = V.zipWith3 SystemStateG x1s x2s x3s
> where
> ix = V.length xPrevs
>
> x1Prevs = V.map gx1 xPrevs
> x2Prevs = V.map gx2 xPrevs
> x3Prevs = V.map gx3 xPrevs
>
> deltaTs = V.replicate ix deltaT
> x1s = x1Prevs .+ (x2Prevs .* deltaTs)
> x2s = x2Prevs .- (x3Prevs .* (V.map sin x1Prevs) .* deltaTs)
> x3s = x3Prevs
and its noisy version.
> stateUpdateNoisyG :: MonadRandom m =>
> Sym 3 ->
> Particles (SystemStateG Double) ->
> m (Particles (SystemStateG Double))
> stateUpdateNoisyG bigQ xPrevs = do
> let ix = V.length xPrevs
>
> let xs = stateUpdateG xPrevs
>
> x1s = V.map gx1 xs
> x2s = V.map gx2 xs
> x3s = V.map gx3 xs
>
> etas <- replicateM ix $ sample $ rvar (MultivariateNormal 0.0 bigQ)
> let eta1s, eta2s, eta3s :: V.Vector Double
> eta1s = V.fromList $ map (fst . headTail) etas
> eta2s = V.fromList $ map (fst . headTail . snd . headTail) etas
> eta3s = V.fromList $ map (fst . headTail . snd . headTail . snd . headTail) etas
>
> return (V.zipWith3 SystemStateG (x1s .+ eta1s) (x2s .+ eta2s) (x3s .+ eta3s))
The function which maps the state to the observable.
> obsUpdateG :: Particles (SystemStateG Double) ->
> Particles (SystemObs Double)
> obsUpdateG xs = V.map (SystemObs . sin . gx1) xs
The mean and variance of the prior.
> mG :: R 3
> mG = vector [1.6, 0.0, 8.00]
> bigPg :: Sym 3
> bigPg = sym $ matrix [
> 1e-6, 0.0, 0.0
> , 0.0, 1e-6, 0.0
> , 0.0, 0.0, 1e-2
> ]
Parameters for the state update; note that the variance is not exactly the same as in the formulation above.
> bigQg :: Sym 3
> bigQg = sym $ matrix bigQgl
> qc1G :: Double
> qc1G = 0.0001
> sigmaG :: Double
> sigmaG = 1.0e-2
> bigQgl :: [Double]
> bigQgl = [ qc1G * deltaT^3 / 3, qc1G * deltaT^2 / 2, 0.0,
> qc1G * deltaT^2 / 2, qc1G * deltaT, 0.0,
> 0.0, 0.0, sigmaG
> ]
The noise of the measurement.
> bigRg :: Sym 1
> bigRg = sym $ matrix [0.1]
Generate the ensemble of particles from the prior,
> initParticlesG :: MonadRandom m =>
> m (Particles (SystemStateG Double))
> initParticlesG = V.replicateM nParticles $ do
> r <- sample $ rvar (MultivariateNormal mG bigPg)
> let x1 = fst $ headTail r
> x2 = fst $ headTail $ snd $ headTail r
> x3 = fst $ headTail $ snd $ headTail $ snd $ headTail r
> return $ SystemStateG { gx1 = x1, gx2 = x2, gx3 = x3}
run the particle filter,
> runFilterG :: Int -> [Particles (SystemStateG Double)]
> runFilterG n = snd $ evalState action (pureMT 19)
> where
> action = runWriterT $ do
> xs <- lift $ initParticlesG
> V.foldM
> (oneFilteringStep (stateUpdateNoisyG bigQg) obsUpdateG (weight f bigRg))
> xs
> (V.fromList $ map (SystemObs . fst . headTail . snd) (take n pendulumSampleGs))
and extract the estimated parameter from the filter.
> testFilteringG :: Int -> [Double]
> testFilteringG n = map ((/ (fromIntegral nParticles)). sum . V.map gx3) (runFilterG n)
Again we need to run through the filtering distributions starting at
> filterGEstss :: Int -> V.Vector (Particles (SystemStateG Double))
> filterGEstss n = V.reverse $ V.fromList $ runFilterG n
> testSmoothingG :: Int -> Int -> ([Double], [Double], [Double])
> testSmoothingG m n = (\(x, y, z) -> (V.toList x, V.toList y, V.toList z)) $
> mkMeans $
> chunks
> where
>
> chunks =
> V.fromList $ map V.fromList $
> transpose $
> V.toList $ V.map (V.toList) $
> chunksOf m $
> snd $ evalState (runWriterT action) (pureMT 23)
>
> mkMeans yss = (
> V.map (/ (fromIntegral n)) $ V.map V.sum $ V.map (V.map gx1) yss,
> V.map (/ (fromIntegral n)) $ V.map V.sum $ V.map (V.map gx2) yss,
> V.map (/ (fromIntegral n)) $ V.map V.sum $ V.map (V.map gx3) yss
> )
>
> action =
> V.replicateM n $
> oneSmoothingPath' filterGEstss
> (oneSmoothingStep stateUpdateG (weight hG bigQg))
> m
Again by eye we get a better fit but note that we are using the samples in which the state update is noisy as well as the observation so we don’t expect to get a very good fit.
> f :: SystemObs Double -> R 1
> f = vector . pure . y1
> h :: SystemState Double -> R 2
> h u = vector [x1 u , x2 u]
> hG :: SystemStateG Double -> R 3
> hG u = vector [gx1 u , gx2 u, gx3 u]
That these are helpers for the inverse CDF is delayed to another blog post.
> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs
> binarySearch :: (Ord a) =>
> V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
> where
> loop !l !u
> | u <= l = l
> | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
> where k = l + (u - l) `shiftR` 1
> chunksOf :: Int -> V.Vector a -> V.Vector (V.Vector a)
> chunksOf n xs = ys
> where
> l = V.length xs
> m = 1 + (l - 1) `div` n
> ys = V.unfoldrN m (\us -> Just (V.take n us, V.drop n us)) xs
Cappé, Olivier. 2008. “An Introduction to Sequential Monte Carlo for Filtering and Smoothing.” http://www-irma.u-strasbg.fr/~guillou/meeting/cappe.pdf.
Rauch, H. E., C. T. Striebel, and F. Tung. 1965. “Maximum Likelihood Estimates of Linear Dynamic Systems.” Journal of the American Institute of Aeronautics and Astronautics 3 (8): 1445–50.
Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. New York, NY, USA: Cambridge University Press.
The equation of motion for a pendulum of unit length subject to Gaussian white noise is
We can discretize this via the usual Euler method
where and
The explanation of the precise form of the covariance matrix will be the subject of another blog post; for the purpose of exposition of using Stan and, in particular, Stan’s ability to handle ODEs, this detail is not important.
Instead of assuming that we know let us take it to be unknown and that we wish to infer its value using the pendulum as our experimental apparatus.
Stan is a probabilistic programming language which should be welll suited to perform such an inference. We use its interface via the R package rstan.
Let’s generate some samples using Stan but rather than generating exactly the model we have given above, instead let’s solve the differential equation and then add some noise. Of course this won’t quite give us samples from the model the parameters of which we wish to estimate but it will allow us to use Stan’s ODE solving capability.
Here’s the Stan
functions {
real[] pendulum(real t,
real[] y,
real[] theta,
real[] x_r,
int[] x_i) {
real dydt[2];
dydt[1] <- y[2];
dydt[2] <- - theta[1] * sin(y[1]);
return dydt;
}
}
data {
int<lower=1> T;
real y0[2];
real t0;
real ts[T];
real theta[1];
real sigma[2];
}
transformed data {
real x_r[0];
int x_i[0];
}
model {
}
generated quantities {
real y_hat[T,2];
y_hat <- integrate_ode(pendulum, y0, t0, ts, theta, x_r, x_i);
for (t in 1:T) {
y_hat[t,1] <- y_hat[t,1] + normal_rng(0,sigma[1]);
y_hat[t,2] <- y_hat[t,2] + normal_rng(0,sigma[2]);
}
}
And here’s the R to invoke it
library(rstan)
library(mvtnorm)
qc1 = 0.0001
deltaT = 0.01
nSamples = 100
m0 = c(1.6, 0)
g = 9.81
t0 = 0.0
ts = seq(deltaT,nSamples * deltaT,deltaT)
bigQ = matrix(c(qc1 * deltaT^3 / 3, qc1 * deltaT^2 / 2,
qc1 * deltaT^2 / 2, qc1 * deltaT
),
nrow = 2,
ncol = 2,
byrow = TRUE
)
samples <- stan(file = 'Pendulum.stan',
data = list (
T = nSamples,
y0 = m0,
t0 = t0,
ts = ts,
theta = array(g, dim = 1),
sigma = c(bigQ[1,1], bigQ[2,2]),
refresh = -1
),
algorithm="Fixed_param",
seed = 42,
chains = 1,
iter =1
)
We can plot the angle the pendulum subtends to the vertical over time. Note that this is not very noisy.
s <- extract(samples,permuted=FALSE)
plot(s[1,1,1:100])
Now let us suppose that we don’t know the value of and we can only observe the horizontal displacement.
zStan <- sin(s[1,1,1:nSamples])
Now we can use Stan to infer the value of .
functions {
real[] pendulum(real t,
real[] y,
real[] theta,
real[] x_r,
int[] x_i) {
real dydt[2];
dydt[1] <- y[2];
dydt[2] <- - theta[1] * sin(y[1]);
return dydt;
}
}
data {
int<lower=1> T;
real y0[2];
real z[T];
real t0;
real ts[T];
}
transformed data {
real x_r[0];
int x_i[0];
}
parameters {
real theta[1];
vector<lower=0>[1] sigma;
}
model {
real y_hat[T,2];
real z_hat[T];
theta ~ normal(0,1);
sigma ~ cauchy(0,2.5);
y_hat <- integrate_ode(pendulum, y0, t0, ts, theta, x_r, x_i);
for (t in 1:T) {
z_hat[t] <- sin(y_hat[t,1]);
z[t] ~ normal(z_hat[t], sigma);
}
}
Here’s the R to invoke it.
estimates <- stan(file = 'PendulumInfer.stan',
data = list (
T = nSamples,
y0 = m0,
z = zStan,
t0 = t0,
ts = ts
),
seed = 42,
chains = 1,
iter = 1000,
warmup = 500,
refresh = -1
)
e <- extract(estimates,pars=c("theta[1]","sigma[1]","lp__"),permuted=TRUE)
This gives an estiamted valeu for of 9.809999 which is what we would hope.
Now let’s try adding some noise to our observations.
set.seed(42)
epsilons <- rmvnorm(n=nSamples,mean=c(0.0),sigma=bigR)
zStanNoisy <- sin(s[1,1,1:nSamples] + epsilons[,1])
estimatesNoisy <- stan(file = 'PendulumInfer.stan',
data = list (T = nSamples,
y0 = m0,
z = zStanNoisy,
t0 = t0,
ts = ts
),
seed = 42,
chains = 1,
iter = 1000,
warmup = 500,
refresh = -1
)
eNoisy <- extract(estimatesNoisy,pars=c("theta[1]","sigma[1]","lp__"),permuted=TRUE)
This gives an estiamted value for of 8.5871024 which is ok but not great.
To build this page, download the relevant files from github and run this in R:
library(knitr)
knit('Pendulum.Rmd')
And this from command line:
pandoc -s Pendulum.md --filter=./Include > PendulumExpanded.html
Let be a (hidden) Markov process. By hidden, we mean that we are not able to observe it.
And let be an observable Markov process such that
That is the observations are conditionally independent given the state of the hidden process.
As an example let us take the one given in Särkkä (2013) where the movement of a car is given by Newton’s laws of motion and the acceleration is modelled as white noise.
Although we do not do so here, and can be derived from the dynamics. For the purpose of this blog post, we note that they are given by
and
We wish to determine the position and velocity of the car given noisy observations of the position. In general we need the distribution of the hidden path given the observable path. We use the notation to mean the path of starting a and finishing at .
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE FlexibleInstances #-}
> {-# LANGUAGE MultiParamTypeClasses #-}
> {-# LANGUAGE FlexibleContexts #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE BangPatterns #-}
> {-# LANGUAGE GeneralizedNewtypeDeriving #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE TemplateHaskell #-}
> module ParticleSmoothing
> ( simpleSamples
> , carSamples
> , testCar
> , testSimple
> ) where
> import Data.Random.Source.PureMT
> import Data.Random hiding ( StdNormal, Normal )
> import qualified Data.Random as R
> import Control.Monad.State
> import Control.Monad.Writer hiding ( Any, All )
> import qualified Numeric.LinearAlgebra.HMatrix as H
> import Foreign.Storable ( Storable )
> import Data.Maybe ( fromJust )
> import Data.Bits ( shiftR )
> import qualified Data.Vector as V
> import qualified Data.Vector.Unboxed as U
> import Control.Monad.ST
> import System.Random.MWC
> import Data.Array.Repa ( Z(..), (:.)(..), Any(..), computeP,
> extent, DIM1, DIM2, slice, All(..)
> )
> import qualified Data.Array.Repa as Repa
> import qualified Control.Monad.Loops as ML
> import PrettyPrint ()
> import Text.PrettyPrint.HughesPJClass ( Pretty, pPrint )
> import Data.Vector.Unboxed.Deriving
If we could sample then we could approximate the posterior as
If we wish to, we can create marginal estimates
When , this is the filtering estimate.
Prediction
Update
where by definition
We have
where by definition
Prediction
Update
The idea is to simulate paths using the recursion we derived above.
At time we have an approximating distribution
Sample and set . We then have an approximation of the prediction step
Substituting
and again
where and .
Now sample
Let’s specify some values for the example of the car moving in two dimensions.
> deltaT, sigma1, sigma2, qc1, qc2 :: Double
> deltaT = 0.1
> sigma1 = 1/2
> sigma2 = 1/2
> qc1 = 1
> qc2 = 1
> bigA :: H.Matrix Double
> bigA = (4 H.>< 4) bigAl
> bigAl :: [Double]
> bigAl = [1, 0 , deltaT, 0,
> 0, 1, 0, deltaT,
> 0, 0, 1, 0,
> 0, 0, 0, 1]
> bigQ :: H.Herm Double
> bigQ = H.trustSym $ (4 H.>< 4) bigQl
> bigQl :: [Double]
> bigQl = [qc1 * deltaT^3 / 3, 0, qc1 * deltaT^2 / 2, 0,
> 0, qc2 * deltaT^3 / 3, 0, qc2 * deltaT^2 / 2,
> qc1 * deltaT^2 / 2, 0, qc1 * deltaT, 0,
> 0, qc2 * deltaT^2 / 2, 0, qc2 * deltaT]
> bigH :: H.Matrix Double
> bigH = (2 H.>< 4) [1, 0, 0, 0,
> 0, 1, 0, 0]
> bigR :: H.Herm Double
> bigR = H.trustSym $ (2 H.>< 2) [sigma1^2, 0,
> 0, sigma2^2]
> m0 :: H.Vector Double
> m0 = H.fromList [0, 0, 1, -1]
> bigP0 :: H.Herm Double
> bigP0 = H.trustSym $ H.ident 4
> n :: Int
> n = 23
With these we generate hidden and observable sample path.
> carSample :: MonadRandom m =>
> H.Vector Double ->
> m (Maybe ((H.Vector Double, H.Vector Double), H.Vector Double))
> carSample xPrev = do
> xNew <- sample $ rvar (Normal (bigA H.#> xPrev) bigQ)
> yNew <- sample $ rvar (Normal (bigH H.#> xNew) bigR)
> return $ Just ((xNew, yNew), xNew)
> carSamples :: [(H.Vector Double, H.Vector Double)]
> carSamples = evalState (ML.unfoldrM carSample m0) (pureMT 17)
We can plot an example trajectory for the car and the noisy observations that are available to the smoother / filter.
Sadly there is no equivalent to numpy in Haskell. Random number packages generate vectors, for multi-rank arrays there is repa and for fast matrix manipulation there is hmtatrix. Thus for our single step path update function, we have to pass in functions to perform type conversion. Clearly with all the copying inherent in this approach, performance is not going to be great.
The type synonym ArraySmoothing is used to denote the cloud of particles at each time step.
> type ArraySmoothing = Repa.Array Repa.U DIM2
> singleStep :: forall a . U.Unbox a =>
> (a -> H.Vector Double) ->
> (H.Vector Double -> a) ->
> H.Matrix Double ->
> H.Herm Double ->
> H.Matrix Double ->
> H.Herm Double ->
> ArraySmoothing a -> H.Vector Double ->
> WriterT [ArraySmoothing a] (StateT PureMT IO) (ArraySmoothing a)
> singleStep f g bigA bigQ bigH bigR x y = do
> tell[x]
> let (Z :. ix :. jx) = extent x
>
> xHatR <- lift $ computeP $ Repa.slice x (Any :. jx - 1)
> let xHatH = map f $ Repa.toList (xHatR :: Repa.Array Repa.U DIM1 a)
> xTildeNextH <- lift $ mapM (\x -> sample $ rvar (Normal (bigA H.#> x) bigQ)) xHatH
>
> let xTildeNextR = Repa.fromListUnboxed (Z :. ix :. (1 :: Int)) $
> map g xTildeNextH
> xTilde = Repa.append x xTildeNextR
>
> weights = map (normalPdf y bigR) $
> map (bigH H.#>) xTildeNextH
> vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
> cumSumWeights = V.scanl (+) 0 (V.fromList weights)
> totWeight = sum weights
> js = indices (V.map (/ totWeight) $ V.tail cumSumWeights) vs
> xNewV = V.map (\j -> Repa.transpose $
> Repa.reshape (Z :. (1 :: Int) :. jx + 1) $
> slice xTilde (Any :. j :. All)) js
> xNewR = Repa.transpose $ V.foldr Repa.append (xNewV V.! 0) (V.tail xNewV)
> computeP xNewR
The state for the car is a 4-tuple.
> data SystemState a = SystemState { xPos :: a
> , yPos :: a
> , _xSpd :: a
> , _ySpd :: a
> }
We initialize the smoother from some prior distribution.
> initCar :: StateT PureMT IO (ArraySmoothing (SystemState Double))
> initCar = do
> xTilde1 <- replicateM n $ sample $ rvar (Normal m0 bigP0)
> let weights = map (normalPdf (snd $ head carSamples) bigR) $
> map (bigH H.#>) xTilde1
> vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
> cumSumWeights = V.scanl (+) 0 (V.fromList weights)
> js = indices (V.tail cumSumWeights) vs
> xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int)) $
> map ((\[a,b,c,d] -> SystemState a b c d) . H.toList) $
> V.toList $
> V.map ((V.fromList xTilde1) V.!) js
> return xHat1
Now we can run the smoother.
> smootherCar :: StateT PureMT IO
> (ArraySmoothing (SystemState Double)
> , [ArraySmoothing (SystemState Double)])
> smootherCar = runWriterT $ do
> xHat1 <- lift initCar
> foldM (singleStep f g bigA bigQ bigH bigR) xHat1 (take 100 $ map snd $ tail carSamples)
> f :: SystemState Double -> H.Vector Double
> f (SystemState a b c d) = H.fromList [a, b, c, d]
> g :: H.Vector Double -> SystemState Double
> g = (\[a,b,c,d] -> (SystemState a b c d)) . H.toList
And create inferred positions for the car which we then plot.
> testCar :: IO ([Double], [Double])
> testCar = do
> states <- snd <$> evalStateT smootherCar (pureMT 24)
> let xs :: [Repa.Array Repa.D DIM2 Double]
> xs = map (Repa.map xPos) states
> sumXs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose xs)
> let ixs = map extent sumXs
> sumLastXs = map (* (recip $ fromIntegral n)) $
> zipWith (Repa.!) sumXs (map (\(Z :. x) -> Z :. (x - 1)) ixs)
> let ys :: [Repa.Array Repa.D DIM2 Double]
> ys = map (Repa.map yPos) states
> sumYs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose ys)
> let ixsY = map extent sumYs
> sumLastYs = map (* (recip $ fromIntegral n)) $
> zipWith (Repa.!) sumYs (map (\(Z :. x) -> Z :. (x - 1)) ixsY)
> return (sumLastXs, sumLastYs)
So it seems our smoother does quite well at inferring the state at the latest observation, that is, when it is working as a filter. But what about estimates for earlier times? We should do better as we have observations in the past and in the future. Let’s try with a simpler example and a smaller number of particles.
First we create some samples for our simple 1 dimensional linear Gaussian model.
> bigA1, bigQ1, bigR1, bigH1 :: Double
> bigA1 = 0.5
> bigQ1 = 0.1
> bigR1 = 0.1
> bigH1 = 1.0
> simpleSample :: MonadRandom m =>
> Double ->
> m (Maybe ((Double, Double), Double))
> simpleSample xPrev = do
> xNew <- sample $ rvar (R.Normal (bigA1 * xPrev) bigQ1)
> yNew <- sample $ rvar (R.Normal (bigH1 * xNew) bigR1)
> return $ Just ((xNew, yNew), xNew)
> simpleSamples :: [(Double, Double)]
> simpleSamples = evalState (ML.unfoldrM simpleSample 0.0) (pureMT 17)
Again create a prior.
> initSimple :: MonadRandom m => m (ArraySmoothing Double)
> initSimple = do
> let y = snd $ head simpleSamples
> xTilde1 <- replicateM n $ sample $ rvar $ R.Normal y bigR1
> let weights = map (pdf (R.Normal y bigR1)) $
> map (bigH1 *) xTilde1
> totWeight = sum weights
> vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
> cumSumWeights = V.scanl (+) 0 (V.fromList $ map (/ totWeight) weights)
> js = indices (V.tail cumSumWeights) vs
> xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int)) $
> V.toList $
> V.map ((V.fromList xTilde1) V.!) js
> return xHat1
Now we can run the smoother.
> smootherSimple :: StateT PureMT IO (ArraySmoothing Double, [ArraySmoothing Double])
> smootherSimple = runWriterT $ do
> xHat1 <- lift initSimple
> foldM (singleStep f1 g1 ((1 H.>< 1) [bigA1]) (H.trustSym $ (1 H.>< 1) [bigQ1^2])
> ((1 H.>< 1) [bigH1]) (H.trustSym $ (1 H.>< 1) [bigR1^2]))
> xHat1
> (take 20 $ map H.fromList $ map return . map snd $ tail simpleSamples)
> f1 :: Double -> H.Vector Double
> f1 a = H.fromList [a]
> g1 :: H.Vector Double -> Double
> g1 = (\[a] -> a) . H.toList
And finally we can look at the paths not just the means of the marginal distributions at the latest observation time.
> testSimple :: IO [[Double]]
> testSimple = do
> states <- snd <$> evalStateT smootherSimple (pureMT 24)
> let path :: Int -> IO (Repa.Array Repa.U DIM1 Double)
> path i = computeP $ Repa.slice (last states) (Any :. i :. All)
> paths <- mapM path [0..n - 1]
> return $ map Repa.toList paths
We can see that at some point in the past all the current particles have one ancestor. The marginals of the smoothing distribution (at some point in the past) have collapsed on to one particle.
That these are helpers for the inverse CDF is delayed to another blog post.
> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs
> binarySearch :: Ord a =>
> V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
> where
> loop !l !u
> | u <= l = l
> | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
> where k = l + (u - l) `shiftR` 1
The random-fu package does not contain a sampler or pdf for a multivariate normal so we create our own. This should be added to random-fu-multivariate package or something similar.
> normalMultivariate :: H.Vector Double -> H.Herm Double -> RVarT m (H.Vector Double)
> normalMultivariate mu bigSigma = do
> z <- replicateM (H.size mu) (rvarT R.StdNormal)
> return $ mu + bigA H.#> (H.fromList z)
> where
> (vals, bigU) = H.eigSH bigSigma
> lSqrt = H.diag $ H.cmap sqrt vals
> bigA = bigU H.<> lSqrt
> data family Normal k :: *
> data instance Normal (H.Vector Double) = Normal (H.Vector Double) (H.Herm Double)
> instance Distribution Normal (H.Vector Double) where
> rvar (Normal m s) = normalMultivariate m s
> normalPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) =>
> H.Vector a -> H.Herm a -> H.Vector a -> a
> normalPdf mu sigma x = exp $ normalLogPdf mu sigma x
> normalLogPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) =>
> H.Vector a -> H.Herm a -> H.Vector a -> a
> normalLogPdf mu bigSigma x = - H.sumElements (H.cmap log (diagonals dec))
> - 0.5 * (fromIntegral (H.size mu)) * log (2 * pi)
> - 0.5 * s
> where
> dec = fromJust $ H.mbChol bigSigma
> t = fromJust $ H.linearSolve (H.tr dec) (H.asColumn $ x - mu)
> u = H.cmap (\x -> x * x) t
> s = H.sumElements u
> diagonals :: (Storable a, H.Element t, H.Indexable (H.Vector t) a) =>
> H.Matrix t -> H.Vector a
> diagonals m = H.fromList (map (\i -> m H.! i H.! i) [0..n-1])
> where
> n = max (H.rows m) (H.cols m)
> instance PDF Normal (H.Vector Double) where
> pdf (Normal m s) = normalPdf m s
> logPdf (Normal m s) = normalLogPdf m s
> derivingUnbox "SystemState"
> [t| forall a . (U.Unbox a) => SystemState a -> (a, a, a, a) |]
> [| \ (SystemState x y xdot ydot) -> (x, y, xdot, ydot) |]
> [| \ (x, y, xdot, ydot) -> SystemState x y xdot ydot |]
> instance Pretty a => Pretty (SystemState a) where
> pPrint (SystemState x y xdot ydot ) = pPrint (x, y, xdot, ydot)
Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. New York, NY, USA: Cambridge University Press.
Here’s an example of why floating point might really be the best option for numerical calculations.
Suppose you wish to find the roots of a quintic equation.
> import Numeric.AD
> import Data.List
> import Data.Ratio
> p :: Num a => a -> a
> p x = x^5 - 2*x^4 - 3*x^3 + 3*x^2 - 2*x - 1
We can do so using Newton-Raphson using automatic differentiation to calculate the derivative (even though for polynomials this is trivial).
> nr :: Fractional a => [a]
> nr = unfoldr g 0
> where
> g z = let u = z - (p z) / (h z) in Just (u, u)
> h z = let [y] = grad (\[x] -> p x) [z] in y
After 7 iterations we see the size of the denominator is quite large (33308 digits) and the calculation takes many seconds.
ghci> length $ show $ denominator (nr!!7)
33308
On the other hand if we use floating point we get an answer accurate to 1 in after 7 iterations very quickly.
ghci> mapM_ putStrLn $ map show $ take 7 nr
-0.5
-0.3368421052631579
-0.31572844839628944
-0.31530116270327685
-0.31530098645936266
-0.3153009864593327
-0.3153009864593327
The example is taken from here who refers the reader to Nick Higham’s book: Accuracy and Stability of Numerical Algorithms.
Of course we should check we found a right answer.
ghci> p $ nr!!6
0.0
We previously used importance sampling in the case where we did not have a sampler available for the distribution from which we wished to sample. There is an even more compelling case for using importance sampling.
Suppose we wish to estimate the probability of a rare event. For example, suppose we wish to estimate where . In this case, we can look up the answer . But suppose we couldn’t look up the answer. One strategy that might occur to us is to sample and then estimate the probability by counting the number of times out of the total that the sample was bigger than 5. The flaw in this is obvious but let’s try it anyway.
> module Girsanov where
> import qualified Data.Vector as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.Number.Erf
> import Data.List ( transpose )
> samples :: (Foldable f, MonadRandom m) =>
> (Int -> RVar Double -> RVar (f Double)) ->
> Int ->
> m (f Double)
> samples repM n = sample $ repM n $ stdNormal
> biggerThan5 :: Int
> biggerThan5 = length (evalState xs (pureMT 42))
> where
> xs :: MonadRandom m => m [Double]
> xs = liftM (filter (>= 5.0)) $ samples replicateM 100000
As we might have expected, even if we draw 100,000 samples, we estimate this probability quite poorly.
ghci> biggerThan5
0
Using importance sampling we can do a lot better.
Let be a normally distributed random variable with zero mean and unit variance under the Lebesgue measure . As usual we can then define a new probability measure, the law of , by
Thus
where we have defined
Thus we can estimate either by sampling from a normal distribution with mean 0 and counting the number of samples that are above 5 or we can sample from a normal distribution with mean 5 and calculating the appropriately weighted mean
Let’s try this out.
> biggerThan5' :: Double
> biggerThan5' = sum (evalState xs (pureMT 42)) / (fromIntegral n)
> where
> xs :: MonadRandom m => m [Double]
> xs = liftM (map g) $
> liftM (filter (>= 5.0)) $
> liftM (map (+5)) $
> samples replicateM n
> g x = exp $ (5^2 / 2) - 5 * x
> n = 100000
And now we get quite a good estimate.
ghci> biggerThan5'
2.85776225450217e-7
The probability of another rare event we might wish to estimate is that of Brownian Motion crossing a boundary. For example, what is the probability of Browian Motion crossing the line ? Let’s try sampling 100 paths (we restrict the number so the chart is still readable).
> epsilons :: (Foldable f, MonadRandom m) =>
> (Int -> RVar Double -> RVar (f Double)) ->
> Double ->
> Int ->
> m (f Double)
> epsilons repM deltaT n = sample $ repM n $ rvar (Normal 0.0 (sqrt deltaT))
> bM0to1 :: Foldable f =>
> ((Double -> Double -> Double) -> Double -> f Double -> f Double)
> -> (Int -> RVar Double -> RVar (f Double))
> -> Int
> -> Int
> -> f Double
> bM0to1 scan repM seed n =
> scan (+) 0.0 $
> evalState (epsilons repM (recip $ fromIntegral n) n) (pureMT (fromIntegral seed))
We can see by eye in the chart below that again we do quite poorly.
We know that where .
> p :: Double -> Double -> Double
> p a t = 2 * (1 - normcdf (a / sqrt t))
ghci> p 1.0 1.0
0.31731050786291415
ghci> p 2.0 1.0
4.550026389635842e-2
ghci> p 3.0 1.0
2.699796063260207e-3
But what if we didn’t know this formula? Define
where is the measure which makes Brownian Motion.
We can estimate the expectation of
where is 1 if Brownian Motion hits the barrier and 0 otherwise and M is the total number of simulations. We know from visual inspection that this gives poor results but let us try some calculations anyway.
> n = 500
> m = 10000
> supAbove :: Double -> Double
> supAbove a = fromIntegral count / fromIntegral n
> where
> count = length $
> filter (>= a) $
> map (\seed -> maximum $ bM0to1 scanl replicateM seed m) [0..n - 1]
> bM0to1WithDrift seed mu n =
> zipWith (\m x -> x + mu * m * deltaT) [0..] $
> bM0to1 scanl replicateM seed n
> where
> deltaT = recip $ fromIntegral n
ghci> supAbove 1.0
0.326
ghci> supAbove 2.0
7.0e-2
ghci> supAbove 3.0
0.0
As expected for a rare event we get an estimate of 0.
Fortunately we can use importance sampling for paths. If we take where is a constant in Girsanov’s Theorem below then we know that is -Brownian Motion.
We observe that
So we can also estimate the expectation of under as
where is now 1 if Brownian Motion with the specified drift hits the barrier and 0 otherwise, and is Brownian Motion sampled at .
We can see from the chart below that this is going to be better at hitting the required barrier.
Let’s do some calculations.
> supAbove' a = (sum $ zipWith (*) ns ws) / fromIntegral n
> where
> deltaT = recip $ fromIntegral m
>
> uss = map (\seed -> bM0to1 scanl replicateM seed m) [0..n - 1]
> ys = map last uss
> ws = map (\x -> exp (-a * x - 0.5 * a^2)) ys
>
> vss = map (zipWith (\m x -> x + a * m * deltaT) [0..]) uss
> sups = map maximum vss
> ns = map fromIntegral $ map fromEnum $ map (>=a) sups
ghci> supAbove' 1.0
0.31592655955519156
ghci> supAbove' 2.0
4.999395029856741e-2
ghci> supAbove' 3.0
2.3859203473651654e-3
The reader is invited to try the above estimates with 1,000 samples per path to see that even with this respectable number, the calculation goes awry.
If we have a probability space and a non-negative random variable with then we can define a new probability measure on the same -algebra by
For any two probability measures when such a exists, it is called the Radon-Nikodym derivative of with respect to and denoted
Given that we managed to shift a Normal Distribution with non-zero mean in one measure to a Normal Distribution with another mean in another measure by producing the Radon-Nikodym derivative, might it be possible to shift, Brownian Motion with a drift under a one probability measure to be pure Brownian Motion under another probability measure by producing the Radon-Nikodym derivative? The answer is yes as Girsanov’s theorem below shows.
Let be Brownian Motion on a probability space and let be a filtration for this Brownian Motion and let be an adapted process such that the Novikov Sufficiency Condition holds
then there exists a probability measure such that
is equivalent to , that is, .
.
is Brownian Motion on the probabiity space also with the filtration .
In order to prove Girsanov’s Theorem, we need a condition which allows to infer that is a strict martingale. One such useful condition to which we have already alluded is the Novikov Sufficiency Condition.
Define by
Then, temporarily overloading the notation and writing for expectation under , and applying the Novikov Sufficiency Condition to , we have
From whence we see that
And since this characterizes Brownian Motion, we are done.
Let and further let it satisfy the Novikov condition
then the process defined by
is a strict martingale.
Before we prove this, we need two lemmas.
Let for be a non-negative local martingale then is a super-martingale and if further then is a strict martingale.
Proof
Let be a localizing sequence for then for and using Fatou’s lemma and the fact that the stopped process is a strict martingale
Thus is a super-martingale and therefore
By assumption we have thus is a strict martingale.
Let be a non-negative local martingale. If is a localizing sequence such that for some then is a strict martingale.
Proof
By the super-martingale property and thus by dominated convergence we have that
We also have that
By Chebyshev’s inequality (see note (2) below), we have
Taking limits first over and then over we see that
For and we have . Thus
Again taking limits over and then over we have
These two conclusions imply
We can therefore conclude (since is a martingale)
Thus by the preceeding lemma is a strict as well as a local martingale.
First we note that is a local martingale for . Let us show that it is a strict martingale. We can do this if for any localizing sequence we can show
using the preceeding lemma where .
We note that
Now apply Hölder’s inequality with conjugates and where is chosen to ensure that the conjugates are both strictly greater than 1 (otherwise we cannot apply the inequality).
Now let us choose
then
In order to apply Hölder’s inequality we need to check that and that but this amounts to checking that and that . We also need to check that but this amounts to checking that for and this is easily checked to be true.
Re-writing the above inequality with this value of we have
By the first lemma, since is a non-negative local martingale, it is also a supermartingale. Furthermore . Thus
and therefore
Recall we have
Taking logs gives
or in diferential form
We can also apply Itô’s rule to
where denotes the quadratic variation of a stochastic process.
Comparing terms gives the stochastic differential equation
In integral form this can also be written as
Thus is a local martingale (it is defined by a stochastic differential equation) and by the first lemma it is a supermaringale. Hence .
Next we note that
to which we can apply Hölder’s inequality with conjugates to obtain
Applying the supermartingale inequality then gives
Now we can apply the result in Step 2 to the result in Step 1.
We can replace by for any stopping time . Thus for a localizing sequence we have
From which we can conclude
Now we can apply the second lemma to conclude that is a strict martingale.
We have already calculated that
Now apply Hölder’s inequality with conjugates and .
And then we can apply Jensen’s inequality to the last term on the right hand side with the convex function .
Using the inequality we established in Step 2 and the Novikov condition then gives
If we now let we see that we must have . We already now that by the first lemma and so we have finally proved that is a martingale.
We have already used importance sampling and also touched on changes of measure.
Chebyshev’s inequality is usually stated for the second moment but the proof is easily adapted:
Handel, Ramon von. 2007. “Stochastic Calculus, Filtering, and Stochastic Control (Lecture Notes).”
Steele, J.M. 2001. Stochastic Calculus and Financial Applications. Applications of Mathematics. Springer New York. https://books.google.co.uk/books?id=fsgkBAAAQBAJ.
Suppose we wish to model a process described by a differential equation and initial condition
But we wish to do this in the presence of noise. It’s not clear how do to this but maybe we can model the process discretely, add noise and somehow take limits.
Let be a partition of then we can discretise the above, allow the state to be random and add in some noise which we model as samples of Brownian motion at the selected times multiplied by so that we can vary the amount noise depending on the state. We change the notation from to to indicate that the variable is now random over some probability space.
We can suppress explicit mention of and use subscripts to avoid clutter.
We can make this depend continuously on time specifying that
and then telescoping to obtain
In the limit, the second term on the right looks like an ordinary integral with respect to time albeit the integrand is stochastic but what are we to make of the the third term? We know that Brownian motion is nowhere differentiable so it would seem the task is impossible. However, let us see what progress we can make with so-called simple proceses.
Let
where is -measurable. We call such a process simple. We can then define
So if we can produce a sequence of simple processes, that converge in some norm to then we can define
Of course we need to put some conditions of the particular class of stochastic processes for which this is possible and check that the limit exists and is unique.
We consider the , the space of square integrable functions with respect to the product measure where is Lesbegue measure on and is some given probability measure. We further restrict ourselves to progressively measurable functions. More explicitly, we consider the latter class of stochastic processes such that
Let be a bounded, almost surely continuous and progressively measurable process which is (almost surely) for for some positive constant . Define
These processes are cleary progressively measurable and by bounded convergence ( is bounded by hypothesis and is uniformly bounded by the same bound).
Let be a bounded and progressively measurable process which is (almost surely) for for some positive constant . Define
Then is bounded, continuous and progressively measurable and it is well known that as . Again by bounded convergence
Firstly, let be a progressively measurable process which is (almost surely) for for some positive constant . Define . Then is bounded and by dominated convergence
Finally let be a progressively measurable process. Define
Clearly
Let be a simple process such that
then
Now suppose that is a Cauchy sequence of progressively measurable simple functions in then since the difference of two simple processes is again a simple process we can apply the Itô Isometry to deduce that
In other words, is also Cauchy in and since this is complete, we can conclude that
exists (in ). Uniqueness follows using the triangle inequality and the Itô isometry.
We defer proving the definition also makes sense almost surely to another blog post.
This approach seems fairly standard see for example Handel (2007) and Mörters et al. (2010).
Rogers and Williams (2000) takes a more general approach.
Protter (2004) takes a different approach by defining stochastic processes which are good integrators, a more abstract motivation than the one we give here.
The requirement of progressive measurability can be relaxed.
Handel, Ramon von. 2007. “Stochastic Calculus, Filtering, and Stochastic Control (Lecture Notes).”
Mörters, P, Y Peres, O Schramm, and W Werner. 2010. Brownian motion. Cambridge Series on Statistical and Probabilistic Mathematics. Cambridge University Press. http://books.google.co.uk/books?id=e-TbA-dSrzYC.
Protter, P.E. 2004. Stochastic Integration and Differential Equations: Version 2.1. Applications of Mathematics. Springer. http://books.google.co.uk/books?id=mJkFuqwr5xgC.
Rogers, L.C.G., and D. Williams. 2000. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus. Cambridge Mathematical Library. Cambridge University Press. https://books.google.co.uk/books?id=bDQy-zoHWfcC.
Let and be measures on with , a sub -algebra and an integrable random variable () then
Thus
Hence
We note that
is -measurable (it is the result of a projection) and that
Hence
as required.
If you look at the wikipedia article on Hidden Markov Models (HMMs) then you might be forgiven for concluding that these deal only with discrete time and finite state spaces. In fact, HMMs are much more general. Furthermore, a better understanding of such models can be helped by putting them into context. Before actually specifying what an HMM is, let us review something of Markov processes. A subsequent blog post will cover HMMs themselves.
Recall that a transition kernel is a mapping where and are two measurable spaces such that is a probability measure on for all and such that is a measurable function on for all .
For example, we could have and and . Hopefully this should remind you of the transition matrix of a Markov chain.
Recall further that a family of such transitions where is some index set satisfying
gives rise to a Markov process (under some mild conditions — see Rogers and Williams (2000) and Kallenberg (2002) for much more detail), that is, a process in which what happens next only depends on where the process is now and not how it got there.
Let us carry on with our example and take . With a slight abuse of notation and since is finite we can re-write the integral as a sum
which we recognise as a restatement of how Markov transition matrices combine.
A deterministic system can be formulated as a Markov process with a particularly simple transition kernel given by
where is the deterministic state update function (the flow) and is the Dirac delta function.
Let us suppose that the determinstic system is dependent on some time-varying values for which we we are unable or unwish to specify a deterministic model. For example, we may be considering predator-prey model where the parameters cannot explain every aspect. We could augment the deterministic kernel in the previous example with
where we use Greek letters for the parameters (and Roman letters for state) and we use e.g. to indicate probability densities. In other words that the parameters tend to wiggle around like Brown’s pollen particles rather than remaining absolutely fixed.
Of course Brownian motion or diffusion may not be a good model for our parameters; with Brownian motion, the parameters could drift off to . We might believe that our parameters tend to stay close to some given value (mean-reverting) and use the Ornstein-Uhlenbeck kernel.
where expresses how strongly we expect the parameter to respond to perturbations, is the mean to which the process wants to revert (aka the asymptotic mean) and expresses how noisy the process is.
It is sometimes easier to view these transition kernels in terms of stochastic differential equations. Brownian motion can be expressed as
and Ornstein-Uhlenbeck can be expressed as
where is the Wiener process.
Let us check that the latter stochastic differential equation gives the stated kernel. Re-writing it in integral form and without loss of generality taking
Since the integral is of a deterministic function, the distribution of is normal. Thus we need only calculate the mean and variance.
The mean is straightforward.
Without loss of generality assume and writing for covariance
And now we can use Ito and independence
Substituting in gives the desired result.
Kallenberg, O. 2002. Foundations of Modern Probability. Probability and Its Applications. Springer New York. http://books.google.co.uk/books?id=TBgFslMy8V4C.
Rogers, L. C. G., and David Williams. 2000. Diffusions, Markov Processes, and Martingales. Vol. 1. Cambridge Mathematical Library. Cambridge: Cambridge University Press.