Theorem
Let and be measures on with , a sub -algebra and an integrable random variable () then
Proof
Thus
Hence
We note that
is -measurable (it is the result of a projection) and that
Hence
as required.
Let and be measures on with , a sub -algebra and an integrable random variable () then
Thus
Hence
We note that
is -measurable (it is the result of a projection) and that
Hence
as required.
If you look at the wikipedia article on Hidden Markov Models (HMMs) then you might be forgiven for concluding that these deal only with discrete time and finite state spaces. In fact, HMMs are much more general. Furthermore, a better understanding of such models can be helped by putting them into context. Before actually specifying what an HMM is, let us review something of Markov processes. A subsequent blog post will cover HMMs themselves.
Recall that a transition kernel is a mapping where and are two measurable spaces such that is a probability measure on for all and such that is a measurable function on for all .
For example, we could have and and . Hopefully this should remind you of the transition matrix of a Markov chain.
Recall further that a family of such transitions where is some index set satisfying
gives rise to a Markov process (under some mild conditions — see Rogers and Williams (2000) and Kallenberg (2002) for much more detail), that is, a process in which what happens next only depends on where the process is now and not how it got there.
Let us carry on with our example and take . With a slight abuse of notation and since is finite we can re-write the integral as a sum
which we recognise as a restatement of how Markov transition matrices combine.
A deterministic system can be formulated as a Markov process with a particularly simple transition kernel given by
where is the deterministic state update function (the flow) and is the Dirac delta function.
Let us suppose that the determinstic system is dependent on some time-varying values for which we we are unable or unwish to specify a deterministic model. For example, we may be considering predator-prey model where the parameters cannot explain every aspect. We could augment the deterministic kernel in the previous example with
where we use Greek letters for the parameters (and Roman letters for state) and we use e.g. to indicate probability densities. In other words that the parameters tend to wiggle around like Brown’s pollen particles rather than remaining absolutely fixed.
Of course Brownian motion or diffusion may not be a good model for our parameters; with Brownian motion, the parameters could drift off to . We might believe that our parameters tend to stay close to some given value (mean-reverting) and use the Ornstein-Uhlenbeck kernel.
where expresses how strongly we expect the parameter to respond to perturbations, is the mean to which the process wants to revert (aka the asymptotic mean) and expresses how noisy the process is.
It is sometimes easier to view these transition kernels in terms of stochastic differential equations. Brownian motion can be expressed as
and Ornstein-Uhlenbeck can be expressed as
where is the Wiener process.
Let us check that the latter stochastic differential equation gives the stated kernel. Re-writing it in integral form and without loss of generality taking
Since the integral is of a deterministic function, the distribution of is normal. Thus we need only calculate the mean and variance.
The mean is straightforward.
Without loss of generality assume and writing for covariance
And now we can use Ito and independence
Substituting in gives the desired result.
Kallenberg, O. 2002. Foundations of Modern Probability. Probability and Its Applications. Springer New York. http://books.google.co.uk/books?id=TBgFslMy8V4C.
Rogers, L. C. G., and David Williams. 2000. Diffusions, Markov Processes, and Martingales. Vol. 1. Cambridge Mathematical Library. Cambridge: Cambridge University Press.
Here’s the same analysis of estimating population growth using Stan.
data {
int<lower=0> N; // number of observations
vector[N] y; // observed population
}
parameters {
real r;
}
model {
real k;
real p0;
real deltaT;
real sigma;
real mu0;
real sigma0;
vector[N] p;
k <- 1.0;
p0 <- 0.1;
deltaT <- 0.0005;
sigma <- 0.01;
mu0 <- 5;
sigma0 <- 10;
r ~ normal(mu0, sigma0);
for (n in 1:N) {
p[n] <- k * p0 * exp((n - 1) * r * deltaT) / (k + p0 * (exp((n - 1) * r * deltaT) - 1));
y[n] ~ normal(p[n], sigma);
}
}
Empirically, by looking at the posterior, this seems to do a better job than either extended Kalman or vanilla Metropolis.
Let us see if we can estimate the parameter for population growth using MCMC in the example in which we used Kalman filtering.
We recall the model.
And we are allowed to sample at regular intervals
In other words , where is known so the likelihood is
Let us assume a prior of then the posterior becomes
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE NoMonomorphismRestriction #-}
> module PopGrowthMCMC where
>
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
We assume most of the parameters are known with the exception of the the growth rate . We fix this also in order to generate test data.
> k, p0 :: Double
> k = 1.0
> p0 = 0.1
> r, deltaT :: Double
> r = 10.0
> deltaT = 0.0005
> nObs :: Int
> nObs = 150
Here’s the implementation of the logistic function
> logit :: Double -> Double -> Double -> Double
> logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))
Let us create some noisy data.
> sigma :: Double
> sigma = 1e-2
> samples :: [Double]
> samples = zipWith (+) mus epsilons
> where
> mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])
> epsilons = evalState (sample $ replicateM nObs $ rvar (Normal 0.0 sigma)) (pureMT 3)
Arbitrarily let us set the prior to have a rather vague normal distribution.
> mu0, sigma0 :: Double
> mu0 = 5.0
> sigma0 = 1e1
> prior :: Double -> Double
> prior r = exp (-(r - mu0)**2 / (2 * sigma0**2))
> likelihood :: Double -> [Double] -> Double
> likelihood r ys = exp (-sum (zipWith (\y mu -> (y - mu)**2 / (2 * sigma**2)) ys mus))
> where
> mus :: [Double]
> mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])
> posterior :: Double -> [Double] -> Double
> posterior r ys = likelihood r ys * prior r
The Metropolis algorithm tells us that we always jump to a better place but only sometimes jump to a worse place. We count the number of acceptances as we go.
> acceptanceProb' :: Double -> Double -> [Double] -> Double
> acceptanceProb' r r' ys = min 1.0 ((posterior r' ys) / (posterior r ys))
> oneStep :: (Double, Int) -> (Double, Double) -> (Double, Int)
> oneStep (r, nAccs) (proposedJump, acceptOrReject) =
> if acceptOrReject < acceptanceProb' r (r + proposedJump) samples
> then (r + proposedJump, nAccs + 1)
> else (r, nAccs)
Here are our proposals.
> normalisedProposals :: Int -> Double -> Int -> [Double]
> normalisedProposals seed sigma nIters =
> evalState (replicateM nIters (sample (Normal 0.0 sigma)))
> (pureMT $ fromIntegral seed)
We also need samples from the uniform distribution
> acceptOrRejects :: Int -> Int -> [Double]
> acceptOrRejects seed nIters =
> evalState (replicateM nIters (sample stdUniform))
> (pureMT $ fromIntegral seed)
Now we can actually run our simulation. We set the number of jumps and a burn in but do not do any thinning.
> nIters, burnIn :: Int
> nIters = 100000
> burnIn = nIters `div` 10
Let us start our chain to the mean of the prior. In theory this shoudn’t matter as by the time we have burnt in we should be sampling in the high density region of the distribution.
> startMu :: Double
> startMu = 5.0
This jump size should allow us to sample the region of high density at a reasonable granularity.
> jumpVar :: Double
> jumpVar = 0.01
Now we can test our MCMC implementation.
> test :: Int -> [(Double, Int)]
> test seed =
> drop burnIn $
> scanl oneStep (startMu, 0) $
> zip (normalisedProposals seed jumpVar nIters)
> (acceptOrRejects seed nIters)
We put the data into a histogram.
> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
> where
> lower = r - 0.5 * sigma0
> upper = r + 0.5 * sigma0
>
> hist :: Int -> Histogram V.Vector BinD Double
> hist seed = fillBuilder hb (map fst $ test seed)
With 50 observations we don’t seem to be very certain about the growth rate.
With 100 observations we do very much better.
And with 150 observations we do even better.
It is roughly 150 miles from the source of the Thames to Kingston Bridge. If we assume that it flows at about 2 miles per hour then the water at Thames Head will have reached Kingston very roughly at days.
The Environmental Agency measure the flux at Kingston Bridge on a twice daily basis. Can we predict this? In the first instance without any other data and using our observation that Thames flushes itself every 3 days, let us try
where is the flux on day and are independent normal errors with mean and variance some given value .
As it stands, our model is not Markov so we cannot directly apply techniques such as Kalman filtering or particle filtering to estimate the parameters. However we can re-write the model as
Note that the observation map now varies over time so we have modify our Kalman filter implementation to accept a different matrix at each step.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE RankNTypes #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE TypeFamilies #-}
> module Autoregression (
> predictions
> ) where
> import GHC.TypeLits
> import Numeric.LinearAlgebra.Static
> import Data.Maybe ( fromJust )
> import qualified Data.Vector as V
> inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n
> inv m = fromJust $ linSolve m eye
> outer :: forall m n . (KnownNat m, KnownNat n,
> (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
> R n -> Sq n -> [L m n] -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer muPrior sigmaPrior bigHs bigSigmaY bigA bigSigmaX ys = result
> where
> result = scanl update (muPrior, sigmaPrior) (zip ys bigHs)
>
> update :: (R n, Sq n) -> (R m, L m n) -> (R n, Sq n)
> update (xHatFlat, bigSigmaHatFlat) (y, bigH) =
> (xHatFlatNew, bigSigmaHatFlatNew)
> where
> v :: R m
> v = y - bigH #> xHatFlat
> bigS :: Sq m
> bigS = bigH bigSigmaHatFlat (tr bigH) + bigSigmaY
> bigK :: L n m
> bigK = bigSigmaHatFlat (tr bigH) (inv bigS)
> xHat :: R n
> xHat = xHatFlat + bigK #> v
> bigSigmaHat :: Sq n
> bigSigmaHat = bigSigmaHatFlat - bigK bigS (tr bigK)
> xHatFlatNew :: R n
> xHatFlatNew = bigA #> xHat
> bigSigmaHatFlatNew :: Sq n
> bigSigmaHatFlatNew = bigA bigSigmaHat (tr bigA) + bigSigmaX
We can now set up the parameters to run the filter.
> stateVariance :: Double
> stateVariance = 1e-8
> bigSigmaX :: Sq 3
> bigSigmaX = fromList [ stateVariance, 0.0, 0.0
> , 0.0, stateVariance, 0.0
> , 0.0, 0.0, stateVariance
> ]
> bigA :: Sq 3
> bigA = eye
> muPrior :: R 3
> muPrior = fromList [0.0, 0.0, 0.0]
> sigmaPrior :: Sq 3
> sigmaPrior = fromList [ 1e1, 0.0, 0.0
> , 0.0, 1e1, 0.0
> , 0.0, 0.0, 1e1
> ]
> bigHsBuilder :: V.Vector Double -> [L 1 3]
> bigHsBuilder flows =
> V.toList $
> V.zipWith3 (\x0 x1 x2 -> fromList [x0, x1, x2])
> (V.tail flows) (V.tail $ V.tail flows) (V.tail $ V.tail $ V.tail flows)
> obsVariance :: Double
> obsVariance = 1.0e-2
> bigSigmaY :: Sq 1
> bigSigmaY = fromList [ obsVariance ]
> predict :: R 3 -> Double -> Double -> Double -> Double
> predict theta f1 f2 f3 = h1 * f1 + h2 * f2 + h3 * f3
> where
> (h1, t1) = headTail theta
> (h2, t2) = headTail t1
> (h3, _) = headTail t2
> thetas :: V.Vector Double -> [(R 3, Sq 3)]
> thetas flows = outer muPrior sigmaPrior (bigHsBuilder flows)
> bigSigmaY bigA bigSigmaX (map (fromList . return) (V.toList flows))
> predictions :: V.Vector Double -> V.Vector Double
> predictions flows =
> V.zipWith4 predict
> (V.fromList $ map fst (thetas flows))
> flows (V.tail flows) (V.tail $ V.tail flows)
If we assume that parameters are essentially fixed by taking the state variance to be e.g. then the fit is not good.
However, if we assume the parameters to undergo Brownian motion by taking the state variance to be e.g. then we get a much better fit. Of course, Brownian motion is probably not a good way of modelling the parameters; we hardly expect that these could wander off to infinity.
Suppose you want to sample from the truncated normal distribution. One way to do this is to use rejection sampling. But if you do this naïvely then you will run into performance problems. The excellent Devroye (1986) who references Marsaglia (1964) gives an efficient rejection sampling scheme using the Rayleigh distribution. The random-fu package uses the Exponential distribution.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE FlexibleContexts #-}
> import Control.Monad
> import Data.Random
> import qualified Data.Random.Distribution.Normal as N
> import Data.Random.Source.PureMT
> import Control.Monad.State
Here’s the naïve implementation.
> naiveReject :: Double -> RVar Double
> naiveReject x = doit
> where
> doit = do
> y <- N.stdNormal
> if y < x
> then doit
> else return y
And here’s an implementation using random-fu.
> expReject :: Double -> RVar Double
> expReject x = N.normalTail x
Let’s try running both of them
> n :: Int
> n = 10000000
> lower :: Double
> lower = 2.0
> testExp :: [Double]
> testExp = evalState (replicateM n $ sample (expReject lower)) (pureMT 3)
> testNaive :: [Double]
> testNaive = evalState (replicateM n $ sample (naiveReject lower)) (pureMT 3)
> main :: IO ()
> main = do
> print $ sum testExp
> print $ sum testNaive
Let’s try building and running both the naïve and better tuned versions.
ghc -O2 CompareRejects.hs
As we can see below we get 59.98s and 4.28s, a compelling reason to use the tuned version. And the difference in performance will get worse the less of the tail we wish to sample from.
2.3731610476911187e7
11,934,195,432 bytes allocated in the heap
5,257,328 bytes copied during GC
44,312 bytes maximum residency (2 sample(s))
21,224 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 23145 colls, 0 par 0.09s 0.11s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 4.19s ( 4.26s elapsed)
GC time 0.09s ( 0.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 4.28s ( 4.37s elapsed)
%GC time 2.2% (2.6% elapsed)
Alloc rate 2,851,397,967 bytes per MUT second
Productivity 97.8% of total user, 95.7% of total elapsed
2.3732073159369867e7
260,450,762,656 bytes allocated in the heap
111,891,960 bytes copied during GC
85,536 bytes maximum residency (2 sample(s))
76,112 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 512768 colls, 0 par 1.86s 2.24s 0.0000s 0.0008s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 58.12s ( 58.99s elapsed)
GC time 1.86s ( 2.24s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 59.98s ( 61.23s elapsed)
%GC time 3.1% (3.7% elapsed)
Alloc rate 4,481,408,869 bytes per MUT second
Productivity 96.9% of total user, 94.9% of total elapsed
Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer-Verlag. http://books.google.co.uk/books?id=mEw\_AQAAIAAJ.
Marsaglia, G. 1964. “Generating a Variable from the Tail of the Normal Distribution.” J-Technometrics 6 (1): 101–2.
Simple models for e.g. financial option pricing assume that the volatility of an index or a stock is constant, see here for example. However, simple observation of time series show that this is not the case; if it were then the log returns would be white noise
One approach which addresses this, GARCH (Generalised AutoRegressive Conditional Heteroskedasticity), models the evolution of volatility deterministically.
Stochastic volatility models treat the volatility of a return on an asset, such as an option to buy a security, as a Hidden Markov Model (HMM). Typically, the observable data consist of noisy mean-corrected returns on an underlying asset at equally spaced time points.
There is evidence that Stochastic Volatility models (Kim, Shephard, and Chib (1998)) offer increased flexibility over the GARCH family, e.g. see Geweke (1994), Fridman and Harris (1998) and Jacquier, Polson, and Rossi (1994). Despite this and judging by the numbers of questions on the R Special Interest Group on Finance mailing list, the use of GARCH in practice far outweighs that of Stochastic Volatility. Reasons cited are the multiplicity of estimation methods for the latter and the lack of packages (but see here for a recent improvement to the paucity of packages).
In their tutorial on particle filtering, Doucet and Johansen (2011) give an example of stochastic volatility. We save this approach for future blog posts and follow Lopes and Polson and the excellent lecture notes by Hedibert Lopes.
Here’s the model.
We wish to estimate and . To do this via a Gibbs sampler we need to sample from
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE RecursiveDo #-}
> {-# LANGUAGE ExplicitForAll #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE FlexibleContexts #-}
> module StochVol (
> bigM
> , bigM0
> , runMC
> , ys
> , vols
> , expectationTau2
> , varianceTau2
> ) where
> import Numeric.LinearAlgebra.HMatrix hiding ( (===), (|||), Element,
> (<>), (#>), inv )
> import qualified Numeric.LinearAlgebra.Static as S
> import Numeric.LinearAlgebra.Static ( (<>) )
> import GHC.TypeLits
> import Data.Proxy
> import Data.Maybe ( fromJust )
> import Data.Random
> import Data.Random.Source.PureMT
> import Control.Monad.Fix
> import Control.Monad.State.Lazy
> import Control.Monad.Writer hiding ( (<>) )
> import Control.Monad.Loops
> import Control.Applicative
> import qualified Data.Vector as V
> inv :: (KnownNat n, (1 <=? n) ~ 'True) => S.Sq n -> S.Sq n
> inv m = fromJust $ S.linSolve m S.eye
> infixr 8 #>
> (#>) :: (KnownNat m, KnownNat n) => S.L m n -> S.R n -> S.R m
> (#>) = (S.#>)
> type StatsM a = RVarT (Writer [((Double, Double), Double)]) a
> (|||) :: (KnownNat ((+) r1 r2), KnownNat r2, KnownNat c, KnownNat r1) =>
> S.L c r1 -> S.L c r2 -> S.L c ((+) r1 r2)
> (|||) = (S.¦)
Let us take a prior that is standard for linear regression
where and use standard results for linear regression to obtain the required marginal distribution.
That the prior is Normal Inverse Gamma () means
Standard Bayesian analysis for regression tells us that the (conditional) posterior distribution for
where the are IID normal with variance is given by
with
We can re-write the above recursively. We do not need to for this blog article but it will be required in any future blog article which uses Sequential Monte Carlo techniques.
Furthermore
so we can write
and
In the case of our model we can specialise the non-recursive equations as
Let’s re-write the notation to fit our model.
Sample from
We can implement this in Haskell as
> sampleParms ::
> forall n m .
> (KnownNat n, (1 <=? n) ~ 'True) =>
> S.R n -> S.L n 2 -> S.R 2 -> S.Sq 2 -> Double -> Double ->
> RVarT m (S.R 2, Double)
> sampleParms y bigX theta_0 bigLambda_0 a_0 s_02 = do
> let n = natVal (Proxy :: Proxy n)
> a_n = 0.5 * (a_0 + fromIntegral n)
> bigLambda_n = bigLambda_0 + (tr bigX) <> bigX
> invBigLambda_n = inv bigLambda_n
> theta_n = invBigLambda_n #> ((tr bigX) #> y + (tr bigLambda_0) #> theta_0)
> b_0 = 0.5 * a_0 * s_02
> b_n = b_0 +
> 0.5 * (S.extract (S.row y <> S.col y)!0!0) +
> 0.5 * (S.extract (S.row theta_0 <> bigLambda_0 <> S.col theta_0)!0!0) -
> 0.5 * (S.extract (S.row theta_n <> bigLambda_n <> S.col theta_n)!0!0)
> g <- rvarT (Gamma a_n (recip b_n))
> let s2 = recip g
> invBigLambda_n' = m <> invBigLambda_n
> where
> m = S.diag $ S.vector (replicate 2 s2)
> m1 <- rvarT StdNormal
> m2 <- rvarT StdNormal
> let theta_n' :: S.R 2
> theta_n' = theta_n + S.chol (S.sym invBigLambda_n') #> (S.vector [m1, m2])
> return (theta_n', s2)
Using a standard result about conjugate priors and since we have
we can deduce
where
> sampleH0 :: Double ->
> Double ->
> V.Vector Double ->
> Double ->
> Double ->
> Double ->
> RVarT m Double
> sampleH0 iC0 iC0m0 hs mu phi tau2 = do
> let var = recip $ (iC0 + phi^2 / tau2)
> mean = var * (iC0m0 + phi * ((hs V.! 0) - mu) / tau2)
> rvarT (Normal mean (sqrt var))
From the state equation, we have
We also have
Adding the two expressions together gives
Since are standard normal, then conditional on and is normally distributed, and
We also have
Writing
by Bayes’ Theorem we have
where is the probability density function of a normal distribution.
We can sample from this using Metropolis
For each , sample from where is the tuning variance.
For each , compute the acceptance probability
> metropolis :: V.Vector Double ->
> Double ->
> Double ->
> Double ->
> Double ->
> V.Vector Double ->
> Double ->
> RVarT m (V.Vector Double)
> metropolis ys mu phi tau2 h0 hs vh = do
> let eta2s = V.replicate (n-1) (tau2 / (1 + phi^2)) `V.snoc` tau2
> etas = V.map sqrt eta2s
> coef1 = (1 - phi) / (1 + phi^2) * mu
> coef2 = phi / (1 + phi^2)
> mu_n = mu + phi * (hs V.! (n-1))
> mu_1 = coef1 + coef2 * ((hs V.! 1) + h0)
> innerMus = V.zipWith (\hp1 hm1 -> coef1 + coef2 * (hp1 + hm1)) (V.tail (V.tail hs)) hs
> mus = mu_1 `V.cons` innerMus `V.snoc` mu_n
> hs' <- V.mapM (\mu -> rvarT (Normal mu vh)) hs
> let num1s = V.zipWith3 (\mu eta h -> logPdf (Normal mu eta) h) mus etas hs'
> num2s = V.zipWith (\y h -> logPdf (Normal 0.0 (exp (0.5 * h))) y) ys hs'
> nums = V.zipWith (+) num1s num2s
> den1s = V.zipWith3 (\mu eta h -> logPdf (Normal mu eta) h) mus etas hs
> den2s = V.zipWith (\y h -> logPdf (Normal 0.0 (exp (0.5 * h))) y) ys hs
> dens = V.zipWith (+) den1s den2s
> us <- V.replicate n <$> rvarT StdUniform
> let ls = V.zipWith (\n d -> min 0.0 (n - d)) nums dens
> return $ V.zipWith4 (\u l h h' -> if log u < l then h' else h) us ls hs hs'
Now we can write down a single step for our Gibbs sampler, sampling from each marginal in turn.
> singleStep :: Double -> V.Vector Double ->
> (Double, Double, Double, Double, V.Vector Double) ->
> StatsM (Double, Double, Double, Double, V.Vector Double)
> singleStep vh y (mu, phi, tau2, h0, h) = do
> lift $ tell [((mu, phi),tau2)]
> hNew <- metropolis y mu phi tau2 h0 h vh
> h0New <- sampleH0 iC0 iC0m0 hNew mu phi tau2
> let bigX' = (S.col $ S.vector $ replicate n 1.0)
> |||
> (S.col $ S.vector $ V.toList $ h0New `V.cons` V.init hNew)
> bigX = bigX' `asTypeOf` (snd $ valAndType nT)
> newParms <- sampleParms (S.vector $ V.toList h) bigX (S.vector [mu0, phi0]) invBigV0 nu0 s02
> return ( (S.extract (fst newParms))!0
> , (S.extract (fst newParms))!1
> , snd newParms
> , h0New
> , hNew
> )
Let’s create some test data.
> mu', phi', tau2', tau' :: Double
> mu' = -0.00645
> phi' = 0.99
> tau2' = 0.15^2
> tau' = sqrt tau2'
We need to create a statically typed matrix with one dimension the same size as the data so we tie the data size value to the required type.
> nT :: Proxy 500
> nT = Proxy
> valAndType :: KnownNat n => Proxy n -> (Int, S.L n 2)
> valAndType x = (fromIntegral $ natVal x, undefined)
> n :: Int
> n = fst $ valAndType nT
Arbitrarily let us start the process at
> h0 :: Double
> h0 = 0.0
We define the process as a stream (aka co-recursively) using the Haskell recursive do construct. It is not necessary to do this but streams are a natural way to think of stochastic processes.
> hs, vols, sds, ys :: V.Vector Double
> hs = V.fromList $ take n $ fst $ runState hsAux (pureMT 1)
> where
> hsAux :: (MonadFix m, MonadRandom m) => m [Double]
> hsAux = mdo { x0 <- sample (Normal (mu' + phi' * h0) tau')
> ; xs <- mapM (\x -> sample (Normal (mu' + phi' * x) tau')) (x0:xs)
> ; return xs
> }
> vols = V.map exp hs
We can plot the volatility (which we cannot observe directly).
And we can plot the log returns.
> sds = V.map sqrt vols
> ys = fst $ runState ysAux (pureMT 2)
> where
> ysAux = V.mapM (\sd -> sample (Normal 0.0 sd)) sds
We start with a vague prior for
> m0, c0 :: Double
> m0 = 0.0
> c0 = 100.0
For convenience
> iC0, iC0m0 :: Double
> iC0 = recip c0
> iC0m0 = iC0 * m0
Rather than really sample from priors for and let us cheat and assume we sampled the simulated values!
> mu0, phi0, tau20 :: Double
> mu0 = -0.00645
> phi0 = 0.99
> tau20 = 0.15^2
But that we are still very uncertain about them
> bigV0, invBigV0 :: S.Sq 2
> bigV0 = S.diag $ S.fromList [100.0, 100.0]
> invBigV0 = inv bigV0
> nu0, s02 :: Double
> nu0 = 10.0
> s02 = (nu0 - 2) / nu0 * tau20
Note that for the inverse gamma this gives
> expectationTau2, varianceTau2 :: Double
> expectationTau2 = (nu0 * s02 / 2) / ((nu0 / 2) - 1)
> varianceTau2 = (nu0 * s02 / 2)^2 / (((nu0 / 2) - 1)^2 * ((nu0 / 2) - 2))
ghci> expectationTau2
2.25e-2
ghci> varianceTau2
1.6874999999999998e-4
Tuning parameter
> vh :: Double
> vh = 0.1
The burn-in and sample sizes may be too low for actual estimation but will suffice for a demonstration.
> bigM, bigM0 :: Int
> bigM0 = 2000
> bigM = 2000
> multiStep :: StatsM (Double, Double, Double, Double, V.Vector Double)
> multiStep = iterateM_ (singleStep vh ys) (mu0, phi0, tau20, h0, vols)
> runMC :: [((Double, Double), Double)]
> runMC = take bigM $ drop bigM0 $
> execWriter (evalStateT (sample multiStep) (pureMT 42))
And now we can look at the distributions of our estimates
Doucet, Arnaud, and Adam M Johansen. 2011. “A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later.” In Handbook of Nonlinear Filtering. Oxford, UK: Oxford University Press.
Fridman, Moshe, and Lawrence Harris. 1998. “A Maximum Likelihood Approach for Non-Gaussian Stochastic Volatility Models.” Journal of Business & Economic Statistics 16 (3): 284–91.
Geweke, John. 1994. “Bayesian Comparison of Econometric Models.”
Jacquier, Eric, Nicholas G. Polson, and Peter E. Rossi. 1994. “Bayesian Analysis of Stochastic Volatility Models.”
Kim, Sangjoon, Neil Shephard, and Siddhartha Chib. 1998. “Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models.” Review of Economic Studies 65 (3): 361–93. http://ideas.repec.org/a/bla/restud/v65y1998i3p361-93.html.
Conor McBride was not joking when he and his co-author entitled their paper about dependent typing in Haskell “Hasochism”: Lindley and McBride (2013).
In trying to resurrect the Haskell package yarr, it seemed that a dependently typed reverse function needed to be written. Writing such a function turns out to be far from straightforward. How GHC determines that a proof (program) discharges a proposition (type signature) is rather opaque and perhaps not surprisingly the error messages one gets if the proof is incorrect are far from easy to interpret.
I’d like to thank all the folk on StackOverflow whose answers and comments I have used freely below. Needless to say, any errors are entirely mine.
Here are two implementations, each starting from different axioms (NB: I have never seen type families referred to as axioms but it seems helpful to think about them in this way).
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE GADTs #-}
> {-# LANGUAGE KindSignatures #-}
> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE UndecidableInstances #-}
> {-# LANGUAGE ExplicitForAll #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE ScopedTypeVariables #-}
For both implementations, we need propositional equality: if a :~: b
is inhabited by some terminating value, then the type a
is the same as the type b
. Further we need the normal form of an equality proof: Refl :: a :~: a
and a function, gcastWith
which allows us to use internal equality (:~:)
to discharge a required proof of external equality (~)
. Readers familiar with topos theory, for example see Lambek and Scott (1988), will note that the notation is reversed.
> import Data.Type.Equality ( (:~:) (Refl), gcastWith )
For the second of the two approaches adumbrated we will need
> import Data.Proxy
The usual natural numbers:
> data Nat = Z | S Nat
We need some axioms:
> type family (n :: Nat) :+ (m :: Nat) :: Nat where
> Z :+ m = m
> S n :+ m = n :+ S m
We need the usual singleton for Nat
to tie types and terms together.
> data SNat :: Nat -> * where
> SZero :: SNat Z
> SSucc :: SNat n -> SNat (S n)
Now we can prove some lemmas.
First a lemma showing we can push :+
inside a successor, S
.
> succ_plus_id :: SNat n1 -> SNat n2 -> (((S n1) :+ n2) :~: (S (n1 :+ n2)))
> succ_plus_id SZero _ = Refl
> succ_plus_id (SSucc n) m = gcastWith (succ_plus_id n (SSucc m)) Refl
This looks nothing like a standard mathematical proof and it’s hard to know what ghc is doing under the covers but presumably something like this:
SZero
S Z :+ n2 = Z :+ S n2
(by axiom 2) = S n2
(by axiom 1) andS (Z + n2) = S n2
(by axiom 1)S Z :+ n2 = S (Z + n2)
SSucc
SSucc n :: SNat (S k)
so n :: SNat k
and m :: SNat i
so SSucc m :: SNat (S i)
succ_plus id n (SSucc m) :: k ~ S p => S p :+ S i :~: S (p :+ S i)
(by hypothesis)k ~ S p => S p :+ S i :~: S (S p :+ i)
(by axiom 2)k :+ S i :~: S (k :+ i)
(by substitution)S k :+ i :~: S (k :+ i)
(by axiom 2)Second a lemma showing that Z
is also the right unit.
> plus_id_r :: SNat n -> ((n :+ Z) :~: n)
> plus_id_r SZero = Refl
> plus_id_r (SSucc n) = gcastWith (plus_id_r n) (succ_plus_id n SZero)
SZero
Z :+ Z = Z
(by axiom 1)SSucc
SSucc n :: SNat (S k)
so n :: SNat k
plus_id_r n :: k :+ Z :~: k
(by hypothesis)succ_plus_id n SZero :: S k :+ Z :~: S (k + Z)
(by the succ_plus_id
lemma)succ_plus_id n SZero :: k :+ Z ~ k => S k :+ Z :~: S k
(by substitution)plus_id_r n :: k :+ Z :~: k
(by equation 2)Now we can defined vectors which have their lengths encoded in their type.
> infixr 4 :::
> data Vec a n where
> Nil :: Vec a Z
> (:::) :: a -> Vec a n -> Vec a (S n)
We can prove a simple result using the lemma that Z
is a right unit.
> elim0 :: SNat n -> (Vec a (n :+ Z) -> Vec a n)
> elim0 n x = gcastWith (plus_id_r n) x
Armed with this we can write an reverse function in which the length of the result is guaranteed to be the same as the length of the argument.
> size :: Vec a n -> SNat n
> size Nil = SZero
> size (_ ::: xs) = SSucc $ size xs
> accrev :: Vec a n -> Vec a n
> accrev x = elim0 (size x) $ go Nil x where
> go :: Vec a m -> Vec a n -> Vec a (n :+ m)
> go acc Nil = acc
> go acc (x ::: xs) = go (x ::: acc) xs
> toList :: Vec a n -> [a]
> toList Nil = []
> toList (x ::: xs) = x : toList xs
> test0 :: [String]
> test0 = toList $ accrev $ "a" ::: "b" ::: "c" ::: Nil
ghci> test0
["c","b","a"]
For an alternative approach, let us change the axioms slightly.
> type family (n1 :: Nat) + (n2 :: Nat) :: Nat where
> Z + n2 = n2
> (S n1) + n2 = S (n1 + n2)
Now the proof that Z
is a right unit is more straightforward.
> plus_id_r1 :: SNat n -> ((n + Z) :~: n)
> plus_id_r1 SZero = Refl
> plus_id_r1 (SSucc n) = gcastWith (plus_id_r1 n) Refl
For the lemma showing we can push +
inside a successor, S
, we can use a Proxy
.
> plus_succ_r1 :: SNat n1 -> Proxy n2 -> ((n1 + (S n2)) :~: (S (n1 + n2)))
> plus_succ_r1 SZero _ = Refl
> plus_succ_r1 (SSucc n1) proxy_n2 = gcastWith (plus_succ_r1 n1 proxy_n2) Refl
Now we can write our reverse function without having to calculate sizes.
> accrev1 :: Vec a n -> Vec a n
> accrev1 Nil = Nil
> accrev1 list = go SZero Nil list
> where
> go :: SNat n1 -> Vec a n1 -> Vec a n2 -> Vec a (n1 + n2)
> go snat acc Nil = gcastWith (plus_id_r1 snat) acc
> go snat acc (h ::: (t :: Vec a n3)) =
> gcastWith (plus_succ_r1 snat (Proxy :: Proxy n3))
> (go (SSucc snat) (h ::: acc) t)
> test1 :: [String]
> test1 = toList $ accrev1 $ "a" ::: "b" ::: "c" ::: Nil
ghci> test0
["c","b","a"]
Lambek, J., and P.J. Scott. 1988. Introduction to Higher-Order Categorical Logic. Cambridge Studies in Advanced Mathematics. Cambridge University Press. http://books.google.co.uk/books?id=6PY_emBeGjUC.
Lindley, Sam, and Conor McBride. 2013. “Hasochism: The Pleasure and Pain of Dependently Typed Haskell Programming.” In Proceedings of the 2013 ACM SIGPLAN Symposium on Haskell, 81–92. Haskell ’13. New York, NY, USA: ACM. doi:10.1145/2503778.2503786.
Let an be stopping times and let the filtration on which they are defined be right continuous. Then
are stopping times where .
For the first we have and both the latter are in by the definition of a stopping time.
Similarly for the second .
For the fourth we have since .
The third is slightly trickier. For , if and only if for some rational , we have . We can thus we can find such that . Writing we also have . Thus we have if and only if there exist and such that and and . In other words
By right continuity (Protter 2004 Theorem 1) of the filtration, we know the terms on the right hand side are in and so that the whole right hand side is in . We thus know that the left hand side is in and using right continuity again that therefore must be a stopping time.
Protter, P.E. 2004. Stochastic Integration and Differential Equations: Version 2.1. Applications of Mathematics. Springer. http://books.google.co.uk/books?id=mJkFuqwr5xgC.
An extended Kalman filter in Haskell using type level literals and automatic differentiation to provide some guarantees of correctness.
Suppose we wish to model population growth of bees via the logistic equation
We assume the growth rate is unknown and drawn from a normal distribution but the carrying capacity is known and we wish to estimate the growth rate by observing noisy values of the population at discrete times . Note that is entirely deterministic and its stochasticity is only as a result of the fact that the unknown parameter of the logistic equation is sampled from a normal distribution (we could for example be observing different colonies of bees and we know from the literature that bee populations obey the logistic equation and each colony will have different growth rates).
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE RankNTypes #-}
> {-# LANGUAGE BangPatterns #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE TypeFamilies #-}
> module FunWithKalman3 where
> import GHC.TypeLits
> import Numeric.LinearAlgebra.Static
> import Data.Maybe ( fromJust )
> import Numeric.AD
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> import Control.Monad.Loops
The logistic equation is a well known example of a dynamical system which has an analytic solution
Here it is in Haskell
> logit :: Floating a => a -> a -> a -> a
> logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))
We observe a noisy value of population at regular time intervals (where is the time interval)
Using the semi-group property of our dynamical system, we can re-write this as
To convince yourself that this re-formulation is correct, think of the population as starting at ; after 1 time step it has reached and and after two time steps it has reached and this ought to be the same as the point reached after 1 time step starting at , for example
> oneStepFrom0, twoStepsFrom0, oneStepFrom1 :: Double
> oneStepFrom0 = logit 0.1 1.0 (1 * 0.1)
> twoStepsFrom0 = logit 0.1 1.0 (1 * 0.2)
> oneStepFrom1 = logit oneStepFrom0 1.0 (1 * 0.1)
ghci> twoStepsFrom0
0.11949463171139338
ghci> oneStepFrom1
0.1194946317113934
We would like to infer the growth rate not just be able to predict the population so we need to add another variable to our model.
This is almost in the form suitable for estimation using a Kalman filter but the dependency of the state on the previous state is non-linear. We can modify the Kalman filter to create the extended Kalman filter (EKF) by making a linear approximation.
Since the measurement update is trivially linear (even in this more general form), the measurement update step remains unchanged.
By Taylor we have
where is the Jacobian of evaluated at (for the exposition of the extended filter we take to be vector valued hence the use of a bold font). We take to be normally distributed with mean of 0 and ignore any difficulties there may be with using Taylor with stochastic variables.
Applying this at we have
Using the same reasoning as we did as for Kalman filters and writing for we obtain
Note that we pass in the Jacobian of the update function as a function itself in the case of the extended Kalman filter rather than the matrix representing the linear function as we do in the case of the classical Kalman filter.
> k, p0 :: Floating a => a
> k = 1.0
> p0 = 0.1 * k
> r, deltaT :: Floating a => a
> r = 10.0
> deltaT = 0.0005
Relating ad and hmatrix is somewhat unpleasant but this can probably be ameliorated by defining a suitable datatype.
> a :: R 2 -> R 2
> a rpPrev = rNew # pNew
> where
> (r, pPrev) = headTail rpPrev
> rNew :: R 1
> rNew = konst r
>
> (p, _) = headTail pPrev
> pNew :: R 1
> pNew = fromList $ [logit p k (r * deltaT)]
> bigA :: R 2 -> Sq 2
> bigA rp = fromList $ concat $ j [r, p]
> where
> (r, ps) = headTail rp
> (p, _) = headTail ps
> j = jacobian (\[r, p] -> [r, logit p k (r * deltaT)])
For some reason, hmatrix with static guarantees does not yet provide an inverse function for matrices.
> inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n
> inv m = fromJust $ linSolve m eye
Here is the extended Kalman filter itself. The type signatures on the expressions inside the function are not necessary but did help the implementor discover a bug in the mathematical derivation and will hopefully help the reader.
> outer :: forall m n . (KnownNat m, KnownNat n,
> (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
> R n -> Sq n ->
> L m n -> Sq m ->
> (R n -> R n) -> (R n -> Sq n) -> Sq n ->
> [R m] ->
> [(R n, Sq n)]
> outer muPrior sigmaPrior bigH bigSigmaY
> littleA bigABuilder bigSigmaX ys = result
> where
> result = scanl update (muPrior, sigmaPrior) ys
>
> update :: (R n, Sq n) -> R m -> (R n, Sq n)
> update (xHatFlat, bigSigmaHatFlat) y =
> (xHatFlatNew, bigSigmaHatFlatNew)
> where
>
> v :: R m
> v = y - (bigH #> xHatFlat)
>
> bigS :: Sq m
> bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY
>
> bigK :: L n m
> bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS)
>
> xHat :: R n
> xHat = xHatFlat + bigK #> v
>
> bigSigmaHat :: Sq n
> bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK)
>
> bigA :: Sq n
> bigA = bigABuilder xHat
>
> xHatFlatNew :: R n
> xHatFlatNew = littleA xHat
>
> bigSigmaHatFlatNew :: Sq n
> bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX
Now let us create some sample data.
> obsVariance :: Double
> obsVariance = 1e-2
> bigSigmaY :: Sq 1
> bigSigmaY = fromList [obsVariance]
> nObs :: Int
> nObs = 300
> singleSample :: Double -> RVarT (W.Writer [Double]) Double
> singleSample p0 = do
> epsilon <- rvarT (Normal 0.0 obsVariance)
> let p1 = logit p0 k (r * deltaT)
> lift $ W.tell [p1 + epsilon]
> return p1
> streamSample :: RVarT (W.Writer [Double]) Double
> streamSample = iterateM_ singleSample p0
> samples :: [Double]
> samples = take nObs $ snd $
> W.runWriter (evalStateT (sample streamSample) (pureMT 3))
We created our data with a growth rate of
ghci> r
10.0
but let us pretend that we have read the literature on growth rates of bee colonies and we have some big doubts about growth rates but we are almost certain about the size of the colony at .
> muPrior :: R 2
> muPrior = fromList [5.0, 0.1]
>
> sigmaPrior :: Sq 2
> sigmaPrior = fromList [ 1e2, 0.0
> , 0.0, 1e-10
> ]
We only observe the population and not the rate itself.
> bigH :: L 1 2
> bigH = fromList [0.0, 1.0]
Strictly speaking this should be 0 but this is close enough.
> bigSigmaX :: Sq 2
> bigSigmaX = fromList [ 1e-10, 0.0
> , 0.0, 1e-10
> ]
Now we can run our filter and watch it switch away from our prior belief as it accumulates more and more evidence.
> test :: [(R 2, Sq 2)]
> test = outer muPrior sigmaPrior bigH bigSigmaY
> a bigA bigSigmaX (map (fromList . return) samples)