Stochastic Integration

Introduction

Suppose we wish to model a process described by a differential equation and initial condition

\displaystyle   \begin{aligned}  \dot{x}(t) &= a(x, t) \\  x(0) &= a_0  \end{aligned}

But we wish to do this in the presence of noise. It’s not clear how do to this but maybe we can model the process discretely, add noise and somehow take limits.

Let \pi = \{0 = t_0 \leq t_1 \leq \ldots \leq t_n = t\} be a partition of [0, t] then we can discretise the above, allow the state to be random and add in some noise which we model as samples of Brownian motion at the selected times multiplied by b so that we can vary the amount noise depending on the state. We change the notation from x to X(\omega) to indicate that the variable is now random over some probability space.

\displaystyle   \begin{aligned}  {X}(t_{i+1}, \omega) - {X}(t_i, \omega)  &= a({X}(t_i, \omega))(t_{i+1} - t_i) +                                              b({X}(t_i, \omega))(W(t_{i+1}, \omega) - W(t_i, \omega)) \\  X(t_0, \omega) &= A_{0}(\omega)  \end{aligned}

We can suppress explicit mention of \omega and use subscripts to avoid clutter.

\displaystyle   \begin{aligned}  {X}_{t_{i+1}} - {X}_{t_i}  &= a({X}_{t_i})(t_{i+1} - t_i) +                                b({X}_{t_i})(W_{t_{i+1}} - W_{t_i}) \\  X(t_0) &= A_{0}(\omega)  \end{aligned}

We can make this depend continuously on time specifying that

\displaystyle   X_t = X_{t_i} \quad \mathrm{for} \, t \in (t_i, t_{i+1}]

and then telescoping to obtain

\displaystyle   \begin{aligned}  {X}_{t} &= X_{t_0} + \sum_{i=0}^{k-1} a({X}_{t_i})(t_{i+1} - t_i) +                       \sum_{i=0}^{k-1} b({X}_{t_i})(W_{t_{i+1}} - W_{t_i})                       \quad \mathrm{for} \, t \in (t_k, t_{k+1}]  \end{aligned}

In the limit, the second term on the right looks like an ordinary integral with respect to time albeit the integrand is stochastic but what are we to make of the the third term? We know that Brownian motion is nowhere differentiable so it would seem the task is impossible. However, let us see what progress we can make with so-called simple proceses.

Simple Processes

Let

\displaystyle   X(t,\omega) = \sum_{i=0}^{k-1} B_i(\omega)\mathbb{I}_{(t_i, t_{i+1}]}(t)

where B_i is {\cal{F}}(t_i)-measurable. We call such a process simple. We can then define

\displaystyle   \int_0^\infty X_s \mathrm{d}W_s \triangleq \sum_{i=0}^{k-1} B_i{(W_{t_{i+1}} - W_{t_{i+1}})}

So if we can produce a sequence of simple processes, X_n that converge in some norm to X then we can define

\displaystyle   \int_0^\infty X(s)\mathrm{d}W(s) \triangleq \lim_{n \to \infty}\int_0^\infty X_n(s)\mathrm{d}W(s)

Of course we need to put some conditions of the particular class of stochastic processes for which this is possible and check that the limit exists and is unique.

We consider the {\cal{L}}^2(\mu \times \mathbb{P}), the space of square integrable functions with respect to the product measure \mu \otimes \mathbb{P} where \mu is Lesbegue measure on {\mathbb{R}^+} and \mathbb{P} is some given probability measure. We further restrict ourselves to progressively measurable functions. More explicitly, we consider the latter class of stochastic processes such that

\displaystyle   \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s < \infty

Less Simple Processes

Bounded, Almost Surely Continuous and Progressively Adapted

Let X be a bounded, almost surely continuous and progressively measurable process which is (almost surely) 0 for t > T for some positive constant T. Define

\displaystyle   X_n(t, \omega) \triangleq X\bigg(T\frac{i}{n}, \omega\bigg) \quad \mathrm{for} \quad T\frac{i}{n} \leq t < T\frac{i + 1}{n}

These processes are cleary progressively measurable and by bounded convergence (X is bounded by hypothesis and \{X_n\}_{n=0,\ldots} is uniformly bounded by the same bound).

\displaystyle   \lim_{n \to \infty}\|X - X_n\|_2 = 0

Bounded and Progressively Measurable

Let X be a bounded and progressively measurable process which is (almost surely) 0 for t > T for some positive constant T. Define

\displaystyle   X_n(t, \omega) \triangleq \frac{1}{1/n}\int_{t-1/n}^t X(s, \omega) \,\mathrm{d}s

Then X^n(s, \omega) is bounded, continuous and progressively measurable and it is well known that X^n(t, \omega) \rightarrow X(t, \omega) as n \rightarrow 0. Again by bounded convergence

\displaystyle   \lim_{n \to \infty}\|X - X_n\|_2 = 0

Progressively Measurable

Firstly, let X be a progressively measurable process which is (almost surely) 0 for t > T for some positive constant T. Define X_n(t, \omega) = X(t, \omega) \land n. Then X_n is bounded and by dominated convergence

\displaystyle   \lim_{n \to \infty}\|X - X_n\|_2 = 0

Finally let X be a progressively measurable process. Define

\displaystyle   X_n(t, \omega) \triangleq  \begin{cases}  X(t, \omega) & \text{if } t \leq n \\  0            & \text{if } \mathrm{otherwise}  \end{cases}

Clearly

\displaystyle   \lim_{n \to \infty}\|X - X_n\|_2 = 0

The Itô Isometry

Let X be a simple process such that

\displaystyle   \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s < \infty

then

\displaystyle   \mathbb{E}\bigg(\int_0^\infty X_s\,\mathrm{d}W_s\bigg)^2 =  \mathbb{E}\bigg(\sum_{i=0}^{k-1} B_i{(W_{t_{i+1}} - W_{t_{i}})}\bigg)^2 =  \sum_{i=0}^{k-1} \mathbb{E}(B_i)^2({t_{i+1}} - {t_{i}}) =  \mathbb{E}\int_0^\infty X^2_s\,\mathrm{d}s

Now suppose that \{H_n\}_{n \in \mathbb{N}} is a Cauchy sequence of progressively measurable simple functions in {\cal{L}}^2(\mu \times \mathbb{P}) then since the difference of two simple processes is again a simple process we can apply the Itô Isometry to deduce that

\displaystyle   \lim_{m,n \to \infty}\mathbb{E}\bigg(\int_0^\infty (H_n(s) - H_m(s))\,\mathrm{d}W(s)\bigg)^2 =  \lim_{m,n \to \infty}\mathbb{E}\int_0^\infty (H_n(s) - H_m(s))^2\,\mathrm{d}s =  0

In other words, \int_0^\infty H_n(s)\,\mathrm{d}W(s) is also Cauchy in {\cal{L}}^2(\mathbb{P}) and since this is complete, we can conclude that

\displaystyle   \int_0^\infty X(s)\mathrm{d}W(s) \triangleq \lim_{n \to \infty}\int_0^\infty X_n(s)\mathrm{d}W(s)

exists (in {\cal{L}}^2(\mathbb{P})). Uniqueness follows using the triangle inequality and the Itô isometry.

Notes

  1. We defer proving the definition also makes sense almost surely to another blog post.

  2. This approach seems fairly standard see for example Handel (2007) and Mörters et al. (2010).

  3. Rogers and Williams (2000) takes a more general approach.

  4. Protter (2004) takes a different approach by defining stochastic processes which are good integrators, a more abstract motivation than the one we give here.

  5. The requirement of progressive measurability can be relaxed.

Bibliography

Handel, Ramon von. 2007. “Stochastic Calculus, Filtering, and Stochastic Control (Lecture Notes).”

Mörters, P, Y Peres, O Schramm, and W Werner. 2010. Brownian motion. Cambridge Series on Statistical and Probabilistic Mathematics. Cambridge University Press. http://books.google.co.uk/books?id=e-TbA-dSrzYC.

Protter, P.E. 2004. Stochastic Integration and Differential Equations: Version 2.1. Applications of Mathematics. Springer. http://books.google.co.uk/books?id=mJkFuqwr5xgC.

Rogers, L.C.G., and D. Williams. 2000. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus. Cambridge Mathematical Library. Cambridge University Press. https://books.google.co.uk/books?id=bDQy-zoHWfcC.

Some Background on Hidden Markov Models

Introduction

If you look at the wikipedia article on Hidden Markov Models (HMMs) then you might be forgiven for concluding that these deal only with discrete time and finite state spaces. In fact, HMMs are much more general. Furthermore, a better understanding of such models can be helped by putting them into context. Before actually specifying what an HMM is, let us review something of Markov processes. A subsequent blog post will cover HMMs themselves.

Markov Process and Chains

Recall that a transition kernel is a mapping \mu : X \times {\cal{Y}} \rightarrow \overline{\mathbb{R}}_{+} where (X, {\cal{X}}) and (Y, {\cal{Y}}) are two measurable spaces such that \mu(s, \cdot) is a probability measure on {\cal{Y}} for all s \in X and such that \mu(\cdot, A) is a measurable function on X for all A \in {\cal{Y}}.

For example, we could have X = Y = \{a, b\} and {\cal{X}} = {\cal{Y}} = \{\emptyset, \{a\}, \{b\}, \{a,b\}\} and \mu(a,\{a\}) = 0.4, \mu(a,\{b\}) = 0.6, \mu(b,\{a\}) = 0.6, \mu(b,\{b\}) = 0.4. Hopefully this should remind you of the transition matrix of a Markov chain.

Recall further that a family of such transitions \{\mu_t\}_{t \in T} where T is some index set satisfying

\displaystyle  \mu_{t+s}(x, A) = \int_{Y} \mu_s(x, {\mathrm{d}y})\mu_t(y, A)

gives rise to a Markov process (under some mild conditions — see Rogers and Williams (2000) and Kallenberg (2002) for much more detail), that is, a process in which what happens next only depends on where the process is now and not how it got there.

Let us carry on with our example and take T = \mathbb{N}. With a slight abuse of notation and since Y is finite we can re-write the integral as a sum

\displaystyle  \mu_{n+m}(x, z) = \sum_{y \in Y} \mu_m(x, y)\mu_n(y, z)

which we recognise as a restatement of how Markov transition matrices combine.

Some Examples

A Fully Deterministic System

A deterministic system can be formulated as a Markov process with a particularly simple transition kernel given by

\displaystyle  \mu_t(x_s, A) = \delta(f_t(x_s), A) \triangleq  \begin{cases}  1           & \text{if } f_t(x_s) \in    A \\  0           & \text{if } f_t(x_s) \notin A  \end{cases}

where f_t is the deterministic state update function (the flow) and \delta is the Dirac delta function.

Parameters

Let us suppose that the determinstic system is dependent on some time-varying values for which we we are unable or unwish to specify a deterministic model. For example, we may be considering predator-prey model where the parameters cannot explain every aspect. We could augment the deterministic kernel in the previous example with

\displaystyle  \mu_t(\theta_s, {\mathrm{d}\phi}) =  {{\cal{N}} \left( {\mathrm{d}\phi} \,\vert\, \theta_s, \tau^2(t-s) \right)}

where we use Greek letters for the parameters (and Roman letters for state) and we use e.g. {\mathrm{d}\phi} to indicate probability densities. In other words that the parameters tend to wiggle around like Brown’s pollen particles rather than remaining absolutely fixed.

Ornstein-Uhlenbeck

Of course Brownian motion or diffusion may not be a good model for our parameters; with Brownian motion, the parameters could drift off to \pm\infty. We might believe that our parameters tend to stay close to some given value (mean-reverting) and use the Ornstein-Uhlenbeck kernel.

\displaystyle   \mu_t(\theta_s, {\mathrm{d}\phi}) =  {{\cal{N}} \left( {\mathrm{d}\phi} \,\vert\, \alpha + (\theta_s - \alpha)e^{-\beta t},\frac{\sigma^2}{2\beta}\big(1 - e^{-2\beta t}\big) \right)}

where \beta expresses how strongly we expect the parameter to respond to perturbations, \alpha is the mean to which the process wants to revert (aka the asymptotic mean) and \sigma^2 expresses how noisy the process is.

It is sometimes easier to view these transition kernels in terms of stochastic differential equations. Brownian motion can be expressed as

\displaystyle   \mathrm{d}X_t = \sigma\mathrm{d}W_t

and Ornstein-Uhlenbeck can be expressed as

\displaystyle   \mathrm{d}X_t = -\beta(X_t - \alpha)\mathrm{d}t + \sigma\mathrm{d}W_t

where W_t is the Wiener process.

Let us check that the latter stochastic differential equation gives the stated kernel. Re-writing it in integral form and without loss of generality taking s= 0

\displaystyle   X_t = \alpha + (x_0 - \alpha)e^{-\beta t} + \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s

Since the integral is of a deterministic function, the distribution of X_t is normal. Thus we need only calculate the mean and variance.

The mean is straightforward.

\displaystyle   \mathbb{E}[X_t \,\vert\, X_0 = x_0] =  \mathbb{E}\Bigg[\alpha + (x_0 - \alpha)e^{-\beta t} + \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s \,\vert\, X_0 = x_0\Bigg] =  \alpha + (x_0 - \alpha)e^{-\beta t}

Without loss of generality assume t \leq u and writing \mathbb{C} for covariance

\displaystyle   \begin{aligned}  \mathbb{C}[X_u, X_t \,\vert\, X_0 = x_0] &=  \mathbb{E}\Bigg[\Bigg( \sigma\int_0^u e^{-\beta(u - s)}\mathrm{d}W_s  \Bigg)                  \Bigg( \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s  \Bigg)\Bigg] \\  &=  \sigma^2e^{-\beta(u + t)}  \mathbb{E}\Bigg[\Bigg(\int_0^u e^{\beta s}\mathrm{d}W_s\Bigg)            \Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s\Bigg)\Bigg] \\  &=  \sigma^2e^{-\beta(u + t)}  \mathbb{E}\Bigg[\Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s + \int_t^u e^{\beta s}\mathrm{d}W_s\Bigg)            \Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s\Bigg)\Bigg]  \end{aligned}

And now we can use Ito and independence

\displaystyle   \begin{aligned}  \mathbb{C}[X_u, X_t \,\vert\, X_0 = x_0] &=  \sigma^2e^{-\beta(u + t)}\mathbb{E}\Bigg[ \int_0^t e^{2\beta s}\mathrm{d}s  \Bigg] \\  &=  \frac{\sigma^2e^{-\beta(u + t)}}{2\beta}\big(e^{2\beta t} - 1\big)  \end{aligned}

Substituting in t = u gives the desired result.

Bibliography

Kallenberg, O. 2002. Foundations of Modern Probability. Probability and Its Applications. Springer New York. http://books.google.co.uk/books?id=TBgFslMy8V4C.

Rogers, L. C. G., and David Williams. 2000. Diffusions, Markov Processes, and Martingales. Vol. 1. Cambridge Mathematical Library. Cambridge: Cambridge University Press.

Population Growth Estimation via Hamiltonian Monte Carlo

Here’s the same analysis of estimating population growth using Stan.

data {
  int<lower=0> N; // number of observations
  vector[N] y;    // observed population
}

parameters {
  real r;
}

model {
  real k;
  real p0;
  real deltaT;
  real sigma;
  real mu0;
  real sigma0;
  vector[N] p;
  k      <- 1.0;
  p0     <- 0.1;
  deltaT <- 0.0005;
  sigma  <- 0.01;
  mu0    <- 5;
  sigma0 <- 10;

  r ~ normal(mu0, sigma0);

  for (n in 1:N) {
    p[n] <- k * p0 * exp((n - 1) * r * deltaT) / (k + p0 * (exp((n - 1) * r * deltaT) - 1));
    y[n] ~ normal(p[n], sigma);
  }
}

Empirically, by looking at the posterior, this seems to do a better job than either extended Kalman or vanilla Metropolis.

Population Growth Estimation via Markov Chain Monte Carlo

Introduction

Let us see if we can estimate the parameter for population growth using MCMC in the example in which we used Kalman filtering.

We recall the model.

\displaystyle   \begin{aligned}  \dot{p} & =  rp\Big(1 - \frac{p}{k}\Big)  \end{aligned}

\displaystyle   p = \frac{kp_0\exp rt}{k + p_0(\exp rt - 1)}

And we are allowed to sample at regular intervals

\displaystyle   \begin{aligned}  p_i &= \frac{kp_0\exp r\Delta T i}{k + p_0(\exp r\Delta T i - 1)} \\  y_i &= p_i + \epsilon_i  \end{aligned}

In other words y_i \sim {\cal{N}}(p_i, \sigma^2), where \sigma is known so the likelihood is

\displaystyle   p(y\,|\,r) \propto \prod_{i=1}^n \exp{\bigg( -\frac{(y_i - p_i)^2}{2\sigma^2}\bigg)} =  \exp{\bigg( -\sum_{i=1}^n \frac{(y_i - p_i)^2}{2\sigma^2}\bigg)}

Let us assume a prior of r \sim {\cal{N}}(\mu_0,\sigma_0^2) then the posterior becomes

\displaystyle   p(r\,|\,y) \propto \exp{\bigg( -\frac{(r - \mu_0)^2}{2\sigma_0^2} \bigg)} \exp{\bigg( -\sum_{i=1}^n \frac{(y_i - p_i)^2}{2\sigma^2}\bigg)}

Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> module PopGrowthMCMC where
> 
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )

Implementation

We assume most of the parameters are known with the exception of the the growth rate r. We fix this also in order to generate test data.

> k, p0 :: Double
> k = 1.0
> p0 = 0.1
> r, deltaT :: Double
> r = 10.0
> deltaT = 0.0005
> nObs :: Int
> nObs = 150

Here’s the implementation of the logistic function

> logit :: Double -> Double -> Double -> Double
> logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))

Let us create some noisy data.

> sigma :: Double
> sigma = 1e-2
> samples :: [Double]
> samples = zipWith (+) mus epsilons
>   where
>     mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])
>     epsilons = evalState (sample $ replicateM nObs $ rvar (Normal 0.0 sigma)) (pureMT 3)

Arbitrarily let us set the prior to have a rather vague normal distribution.

> mu0, sigma0 :: Double
> mu0 = 5.0
> sigma0 = 1e1
> prior :: Double -> Double
> prior r = exp (-(r - mu0)**2 / (2 * sigma0**2))
> likelihood :: Double -> [Double] -> Double
> likelihood r ys = exp (-sum (zipWith (\y mu -> (y - mu)**2 / (2 * sigma**2)) ys mus))
>   where
>     mus :: [Double]
>     mus = map (logit p0 k . (* (r * deltaT))) (map fromIntegral [0..])
> posterior :: Double -> [Double] -> Double
> posterior r ys = likelihood r ys * prior r

The Metropolis algorithm tells us that we always jump to a better place but only sometimes jump to a worse place. We count the number of acceptances as we go.

> acceptanceProb' :: Double -> Double -> [Double] -> Double
> acceptanceProb' r r' ys = min 1.0 ((posterior r' ys) / (posterior r ys))
> oneStep :: (Double, Int) -> (Double, Double) -> (Double, Int)
> oneStep (r, nAccs) (proposedJump, acceptOrReject) =
>   if acceptOrReject < acceptanceProb' r (r + proposedJump) samples
>   then (r + proposedJump, nAccs + 1)
>   else (r, nAccs)

Here are our proposals.

> normalisedProposals :: Int -> Double -> Int -> [Double]
> normalisedProposals seed sigma nIters =
>   evalState (replicateM nIters (sample (Normal 0.0 sigma)))
>   (pureMT $ fromIntegral seed)

We also need samples from the uniform distribution

> acceptOrRejects :: Int -> Int -> [Double]
> acceptOrRejects seed nIters =
>   evalState (replicateM nIters (sample stdUniform))
>   (pureMT $ fromIntegral seed)

Now we can actually run our simulation. We set the number of jumps and a burn in but do not do any thinning.

> nIters, burnIn :: Int
> nIters = 100000
> burnIn = nIters `div` 10

Let us start our chain to the mean of the prior. In theory this shoudn’t matter as by the time we have burnt in we should be sampling in the high density region of the distribution.

> startMu :: Double
> startMu = 5.0

This jump size should allow us to sample the region of high density at a reasonable granularity.

> jumpVar :: Double
> jumpVar = 0.01

Now we can test our MCMC implementation.

> test :: Int -> [(Double, Int)]
> test seed =
>   drop burnIn $
>   scanl oneStep (startMu, 0) $
>   zip (normalisedProposals seed jumpVar nIters)
>       (acceptOrRejects seed nIters)

We put the data into a histogram.

> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
>   where
>     lower = r - 0.5 * sigma0
>     upper = r + 0.5 * sigma0
> 
> hist :: Int -> Histogram V.Vector BinD Double
> hist seed = fillBuilder hb (map fst $ test seed)

With 50 observations we don’t seem to be very certain about the growth rate.

With 100 observations we do very much better.

And with 150 observations we do even better.

The Flow of the Thames: An Autoregressive Model

Thames Flux

It is roughly 150 miles from the source of the Thames to Kingston Bridge. If we assume that it flows at about 2 miles per hour then the water at Thames Head will have reached Kingston very roughly at \frac{150}{24\times 2} \approxeq 3 days.

The Environmental Agency measure the flux at Kingston Bridge on a twice daily basis. Can we predict this? In the first instance without any other data and using our observation that Thames flushes itself every 3 days, let us try

\displaystyle   X_t = \theta_1 X_{t-1} + \theta_2 X_{t-2} + \theta_3 X_{t-3} + \epsilon_t

where X_t is the flux on day t and \{\epsilon_t\}_{t \in \mathbb{N}} are independent normal errors with mean 0 and variance some given value \sigma^2.

Kalman

As it stands, our model is not Markov so we cannot directly apply techniques such as Kalman filtering or particle filtering to estimate the parameters. However we can re-write the model as

\displaystyle   \begin{bmatrix}  \theta_1^{(t)} \\  \theta_2^{(t)} \\  \theta_3^{(t)}  \end{bmatrix} =  \begin{bmatrix}  1 & 0 & 0 \\  0 & 1 & 0 \\  0 & 0 & 1  \end{bmatrix}  \begin{bmatrix}  \theta_1^{(t-1)} \\  \theta_2^{(t-1)} \\  \theta_3^{(t-1)}  \end{bmatrix} +  \begin{bmatrix}  \eta_{t} \\  \eta_{t} \\  \eta_{t}  \end{bmatrix}

\displaystyle   y_t = \begin{bmatrix}        x_{t-1} & x_{t-2} & x_{t-3}        \end{bmatrix}  \begin{bmatrix}  \theta_1^{(t)} \\  \theta_2^{(t)} \\  \theta_3^{(t)}  \end{bmatrix} +  \epsilon_{t}

Note that the observation map now varies over time so we have modify our Kalman filter implementation to accept a different matrix at each step.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> {-# LANGUAGE TypeOperators                #-}
> {-# LANGUAGE TypeFamilies                 #-}
> module Autoregression (
>     predictions
>   ) where
> import GHC.TypeLits
> import Numeric.LinearAlgebra.Static
> import Data.Maybe ( fromJust )
> import qualified Data.Vector as V
> inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n
> inv m = fromJust $ linSolve m eye
> outer ::  forall m n . (KnownNat m, KnownNat n,
>                         (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
>           R n -> Sq n -> [L m n] -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer muPrior sigmaPrior bigHs bigSigmaY bigA bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) (zip ys bigHs)
> 
>     update :: (R n, Sq n) -> (R m, L m n) -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) (y, bigH) =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>         v :: R m
>         v = y - bigH #> xHatFlat
>         bigS :: Sq m
>         bigS = bigH  bigSigmaHatFlat  (tr bigH) + bigSigmaY
>         bigK :: L n m
>         bigK = bigSigmaHatFlat  (tr bigH)  (inv bigS)
>         xHat :: R n
>         xHat = xHatFlat + bigK #> v
>         bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK  bigS  (tr bigK)
>         xHatFlatNew :: R n
>         xHatFlatNew = bigA #> xHat
>         bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA  bigSigmaHat  (tr bigA) + bigSigmaX

We can now set up the parameters to run the filter.

> stateVariance :: Double
> stateVariance = 1e-8
> bigSigmaX :: Sq 3
> bigSigmaX = fromList [ stateVariance, 0.0,           0.0
>                      , 0.0,           stateVariance, 0.0
>                      , 0.0,           0.0,           stateVariance
>                      ]
> bigA :: Sq 3
> bigA = eye
> muPrior :: R 3
> muPrior = fromList [0.0, 0.0, 0.0]
> sigmaPrior :: Sq 3
> sigmaPrior = fromList [ 1e1, 0.0, 0.0
>                       , 0.0, 1e1, 0.0
>                       , 0.0, 0.0, 1e1
>                       ]
> bigHsBuilder :: V.Vector Double -> [L 1 3]
> bigHsBuilder flows =
>   V.toList $
>   V.zipWith3 (\x0 x1 x2 -> fromList [x0, x1, x2])
>   (V.tail flows) (V.tail $ V.tail flows) (V.tail $ V.tail $ V.tail flows)
> obsVariance :: Double
> obsVariance = 1.0e-2
> bigSigmaY :: Sq 1
> bigSigmaY = fromList [ obsVariance ]
> predict :: R 3 -> Double -> Double -> Double -> Double
> predict theta f1 f2 f3 = h1 * f1 + h2 * f2 + h3 * f3
>   where
>     (h1, t1) = headTail theta
>     (h2, t2) = headTail t1
>     (h3, _)  = headTail t2
> thetas :: V.Vector Double -> [(R 3, Sq 3)]
> thetas flows = outer muPrior sigmaPrior (bigHsBuilder flows)
>                bigSigmaY bigA bigSigmaX (map (fromList . return) (V.toList flows))
> predictions :: V.Vector Double -> V.Vector Double
> predictions flows =
>   V.zipWith4 predict
>   (V.fromList $ map fst (thetas flows))
>   flows (V.tail flows) (V.tail $ V.tail flows)

How Good is Our Model?

If we assume that parameters are essentially fixed by taking the state variance to be e.g. 10^{-8} then the fit is not good.

However, if we assume the parameters to undergo Brownian motion by taking the state variance to be e.g. 10^{-2} then we get a much better fit. Of course, Brownian motion is probably not a good way of modelling the parameters; we hardly expect that these could wander off to infinity.

Rejection Sampling

Introduction

Suppose you want to sample from the truncated normal distribution. One way to do this is to use rejection sampling. But if you do this naïvely then you will run into performance problems. The excellent Devroye (1986) who references Marsaglia (1964) gives an efficient rejection sampling scheme using the Rayleigh distribution. The random-fu package uses the Exponential distribution.

Performance

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE FlexibleContexts             #-}
> import Control.Monad
> import Data.Random
> import qualified Data.Random.Distribution.Normal as N
> import Data.Random.Source.PureMT
> import Control.Monad.State

Here’s the naïve implementation.

> naiveReject :: Double -> RVar Double
> naiveReject x = doit
>   where
>     doit = do
>       y <- N.stdNormal
>       if y < x
>         then doit
>         else return y

And here’s an implementation using random-fu.

> expReject :: Double -> RVar Double
> expReject x = N.normalTail x

Let’s try running both of them

> n :: Int
> n = 10000000
> lower :: Double
> lower = 2.0
> testExp :: [Double]
> testExp = evalState (replicateM n $ sample (expReject lower)) (pureMT 3)
> testNaive :: [Double]
> testNaive = evalState (replicateM n $ sample (naiveReject lower)) (pureMT 3)
> main :: IO ()
> main = do
>   print $ sum testExp
>   print $ sum testNaive

Let’s try building and running both the naïve and better tuned versions.

ghc -O2 CompareRejects.hs

As we can see below we get 59.98s and 4.28s, a compelling reason to use the tuned version. And the difference in performance will get worse the less of the tail we wish to sample from.

Tuned

2.3731610476911187e7
  11,934,195,432 bytes allocated in the heap
       5,257,328 bytes copied during GC
          44,312 bytes maximum residency (2 sample(s))
          21,224 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     23145 colls,     0 par    0.09s    0.11s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    4.19s  (  4.26s elapsed)
  GC      time    0.09s  (  0.11s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    4.28s  (  4.37s elapsed)

  %GC     time       2.2%  (2.6% elapsed)

  Alloc rate    2,851,397,967 bytes per MUT second

  Productivity  97.8% of total user, 95.7% of total elapsed

Naïve

2.3732073159369867e7
 260,450,762,656 bytes allocated in the heap
     111,891,960 bytes copied during GC
          85,536 bytes maximum residency (2 sample(s))
          76,112 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     512768 colls,     0 par    1.86s    2.24s     0.0000s    0.0008s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   58.12s  ( 58.99s elapsed)
  GC      time    1.86s  (  2.24s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    59.98s  ( 61.23s elapsed)

  %GC     time       3.1%  (3.7% elapsed)

  Alloc rate    4,481,408,869 bytes per MUT second

  Productivity  96.9% of total user, 94.9% of total elapsed

Bibliography

Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer-Verlag. http://books.google.co.uk/books?id=mEw\_AQAAIAAJ.

Marsaglia, G. 1964. “Generating a Variable from the Tail of the Normal Distribution.” J-Technometrics 6 (1): 101–2.

Stochastic Volatility

Introduction

Simple models for e.g. financial option pricing assume that the volatility of an index or a stock is constant, see here for example. However, simple observation of time series show that this is not the case; if it were then the log returns would be white noise

One approach which addresses this, GARCH (Generalised AutoRegressive Conditional Heteroskedasticity), models the evolution of volatility deterministically.

Stochastic volatility models treat the volatility of a return on an asset, such as an option to buy a security, as a Hidden Markov Model (HMM). Typically, the observable data consist of noisy mean-corrected returns on an underlying asset at equally spaced time points.

There is evidence that Stochastic Volatility models (Kim, Shephard, and Chib (1998)) offer increased flexibility over the GARCH family, e.g. see Geweke (1994), Fridman and Harris (1998) and Jacquier, Polson, and Rossi (1994). Despite this and judging by the numbers of questions on the R Special Interest Group on Finance mailing list, the use of GARCH in practice far outweighs that of Stochastic Volatility. Reasons cited are the multiplicity of estimation methods for the latter and the lack of packages (but see here for a recent improvement to the paucity of packages).

In their tutorial on particle filtering, Doucet and Johansen (2011) give an example of stochastic volatility. We save this approach for future blog posts and follow Lopes and Polson and the excellent lecture notes by Hedibert Lopes.

Here’s the model.

\displaystyle   \begin{aligned}  H_0     &\sim {\mathcal{N}}\left( m_0, C_0\right) \\  H_t     &= \mu + \phi H_{t-1} + \tau \eta_t \\   Y_n     &= \beta \exp(H_t / 2) \epsilon_n \\  \end{aligned}

We wish to estimate \mu, \phi, \tau and \boldsymbol{h}. To do this via a Gibbs sampler we need to sample from

\displaystyle   {p \left( \mu, \phi, \tau \,\vert\, \boldsymbol{h}, \boldsymbol{y} \right)} \quad \text{and} \quad  {p \left( \boldsymbol{h} \,\vert\, \mu, \phi, \tau, \boldsymbol{y} \right)}

Haskell Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE RecursiveDo                   #-}
> {-# LANGUAGE ExplicitForAll                #-}
> {-# LANGUAGE TypeOperators                 #-}
> {-# LANGUAGE TypeFamilies                  #-}
> {-# LANGUAGE ScopedTypeVariables           #-}
> {-# LANGUAGE DataKinds                     #-}
> {-# LANGUAGE FlexibleContexts              #-}
> module StochVol (
>     bigM
>   , bigM0
>   , runMC
>   , ys
>   , vols
>   , expectationTau2
>   , varianceTau2
>   ) where
> import Numeric.LinearAlgebra.HMatrix hiding ( (===), (|||), Element,
>                                               (<>), (#>), inv )
> import qualified Numeric.LinearAlgebra.Static as S
> import Numeric.LinearAlgebra.Static ( (<>) )
> import GHC.TypeLits
> import Data.Proxy
> import Data.Maybe ( fromJust )
> import Data.Random
> import Data.Random.Source.PureMT
> import Control.Monad.Fix
> import Control.Monad.State.Lazy
> import Control.Monad.Writer hiding ( (<>) )
> import Control.Monad.Loops
> import Control.Applicative
> import qualified Data.Vector as V
> inv :: (KnownNat n, (1 <=? n) ~ 'True) => S.Sq n -> S.Sq n
> inv m = fromJust $ S.linSolve m S.eye
> infixr 8 #>
> (#>) :: (KnownNat m, KnownNat n) => S.L m n -> S.R n -> S.R m
> (#>) = (S.#>)
> type StatsM a = RVarT (Writer [((Double, Double), Double)]) a
> (|||) :: (KnownNat ((+) r1 r2), KnownNat r2, KnownNat c, KnownNat r1) =>
>          S.L c r1 -> S.L c r2 -> S.L c ((+) r1 r2)
> (|||) = (S.¦)

Marginal Distribution for Parameters

Let us take a prior that is standard for linear regression

\displaystyle   (\boldsymbol{\theta}, \tau^2) \sim {\mathcal{NIG}}(\boldsymbol{\theta}_0, V_0, \nu_0, s_0^2)

where \boldsymbol{\theta} = (\mu, \phi)^\top and use standard results for linear regression to obtain the required marginal distribution.

That the prior is Normal Inverse Gamma ({\cal{NIG}}) means

\displaystyle   \begin{aligned}  \boldsymbol{\theta} \, | \, \tau^2 & \sim {\cal{N}}(\boldsymbol{\theta}_0, \tau^2 V_0) \\  \tau^2                             & \sim {\cal{IG}}(\nu_0 / 2, \nu_0 s_0^2 / 2)  \end{aligned}

Standard Bayesian analysis for regression tells us that the (conditional) posterior distribution for

\displaystyle   y_i = \beta + \alpha x_i + \epsilon_i

where the \{\epsilon_i\} are IID normal with variance \sigma^2 is given by

\displaystyle   {p \left( \alpha, \beta, \eta \,\vert\, \boldsymbol{y}, \boldsymbol{x} \right)} =  {\cal{N}}((\alpha, \beta); \mu_n, \sigma^2\Lambda_n^{-1})\,{\cal{IG}}(a_n, b_n)

with

\displaystyle   \Lambda_n = X_n^\top X_n  + \Lambda_0

\displaystyle   \begin{matrix}  \mu_n = \Lambda_n^{-1}({X_n}^{\top}{X_n}\hat{\boldsymbol{\beta}}_n + \Lambda_0\mu_0) &  \textrm{where} &  \hat{\boldsymbol\beta}_n = ({X}_n^{\rm T}{X}_n)^{-1}{X}_n^{\rm T}\boldsymbol{y}_n  \end{matrix}

\displaystyle   \begin{matrix}  a_n = \frac{n}{2} + a_0 & \quad &  b_n = b_0 +        \frac{1}{2}(\boldsymbol{y}^\top\boldsymbol{y} +                    \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 -                    \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n)  \end{matrix}

Recursive Form

We can re-write the above recursively. We do not need to for this blog article but it will be required in any future blog article which uses Sequential Monte Carlo techniques.

\displaystyle   \Lambda_n = \boldsymbol{x}_n^\top \boldsymbol{x}_n  + \Lambda_{n-1}

Furthermore

\displaystyle   \Lambda_{n}\mu_{n} =  {X}_{n}^{\rm T}\boldsymbol{y}_{n} + \Lambda_0\mu_0 =  {X}_{n-1}^{\rm T}\boldsymbol{y}_{n-1} + \boldsymbol{x}_n^\top y_n + \Lambda_0\mu_0 =  \Lambda_{n-1}\mu_{n-1} + \boldsymbol{x}_n^\top y_n

so we can write

\displaystyle   \boldsymbol{\mu}_n = \Lambda_n^{-1}(\Lambda_{n-1}\mu_{n-1} + \boldsymbol{x}_n^\top y_n)

and

\displaystyle   \begin{matrix}  a_n = a_{n-1} + \frac{1}{2} & \quad &  b_n = b_{n-1} + \frac{1}{2}\big[(y_n - \boldsymbol{\mu}_n^\top \boldsymbol{x}_n)y_n + (\boldsymbol{\mu}_{n-1} - \boldsymbol{\mu}_{n})^\top \Lambda_{n-1}\boldsymbol{\mu}_{n-1}\big]  \end{matrix}

Specialising

In the case of our model we can specialise the non-recursive equations as

\displaystyle   \Lambda_n = \begin{bmatrix} 1 & 1 & \ldots & 1 \\                              x_1 & x_2 & \ldots & x_n              \end{bmatrix}              \begin{bmatrix} 1 & x_1 \\                              1 & x_2 \\                              \ldots & \ldots \\                              1 & x_n              \end{bmatrix}  + \Lambda_0

\displaystyle   \begin{matrix}  \mu_n = (\Lambda_n)^{-1}({X_n}^{\top}{X_n}\hat{\boldsymbol{\beta}}_n + \Lambda_0\mu_0) &  \textrm{where} &  \hat{\boldsymbol\beta}_n = ({X}_n^{\rm T}{X}_n)^{-1}{X}_n^{\rm T}\boldsymbol{x}_{1:n}  \end{matrix}

\displaystyle   \begin{matrix}  a_n = \frac{n}{2} + a_0 & \quad &  b_n = b_0 +        \frac{1}{2}(\boldsymbol{x}_{1:n}^\top\boldsymbol{x}_{1:n} +                    \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 -                    \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n)  \end{matrix}

Let’s re-write the notation to fit our model.

\displaystyle   \Lambda_n = \begin{bmatrix} 1 & 1 & \ldots & 1 \\                              h_1 & h_2 & \ldots & h_n              \end{bmatrix}              \begin{bmatrix} 1 & h_1 \\                              1 & h_2 \\                              \ldots & \ldots \\                              1 & h_n              \end{bmatrix}  + \Lambda_0

\displaystyle   \begin{matrix}  \mu_n = (\Lambda_n)^{-1}({H_n}^{\top}{H_n}\hat{\boldsymbol{\theta}}_n + \Lambda_0\mu_0) &  \textrm{where} &  \hat{\boldsymbol\theta}_n = ({H}_n^{\rm T}{H}_n)^{-1}{H}_n^{\rm T}\boldsymbol{x}_{1:n}  \end{matrix}

\displaystyle   \begin{matrix}  a_n = \frac{n}{2} + a_0 & \quad &  b_n = b_0 +        \frac{1}{2}(\boldsymbol{x}_{1:n}^\top\boldsymbol{x}_{1:n} +                    \boldsymbol{\mu}_0^\top\Lambda_0\boldsymbol{\mu}_0 -                    \boldsymbol{\mu}_n^\top\Lambda_n\boldsymbol{\mu}_n)  \end{matrix}

Sample from {p \left( \boldsymbol{\theta}, \tau^2 \,\vert\, \boldsymbol{h}, \boldsymbol{y} \right)} \sim {\mathcal{NIG}}(\boldsymbol{\theta}_1, V_1, \nu_1, s_1^2)

We can implement this in Haskell as

> sampleParms ::
>   forall n m .
>   (KnownNat n, (1 <=? n) ~ 'True) =>
>   S.R n -> S.L n 2 -> S.R 2 -> S.Sq 2 -> Double -> Double ->
>   RVarT m (S.R 2, Double)
> sampleParms y bigX theta_0 bigLambda_0 a_0 s_02 = do
>   let n = natVal (Proxy :: Proxy n)
>       a_n = 0.5 * (a_0 + fromIntegral n)
>       bigLambda_n = bigLambda_0 + (tr bigX) <> bigX
>       invBigLambda_n = inv bigLambda_n
>       theta_n = invBigLambda_n #> ((tr bigX) #> y + (tr bigLambda_0) #> theta_0)
>       b_0 = 0.5 * a_0 * s_02
>       b_n = b_0 +
>             0.5 * (S.extract (S.row y <> S.col y)!0!0) +
>             0.5 * (S.extract (S.row theta_0 <> bigLambda_0 <> S.col theta_0)!0!0) -
>             0.5 * (S.extract (S.row theta_n <> bigLambda_n <> S.col theta_n)!0!0)
>   g <- rvarT (Gamma a_n (recip b_n))
>   let s2 = recip g
>       invBigLambda_n' = m <> invBigLambda_n
>         where
>           m = S.diag $ S.vector (replicate 2 s2)
>   m1 <- rvarT StdNormal
>   m2 <- rvarT StdNormal
>   let theta_n' :: S.R 2
>       theta_n' = theta_n + S.chol (S.sym invBigLambda_n') #> (S.vector [m1, m2])
>   return (theta_n', s2)

Marginal Distribution for State

Marginal for H_0

Using a standard result about conjugate priors and since we have

\displaystyle   h_0 \sim {\cal{N}}(m_0,C_0) \quad h_1 \vert h_0 \sim {\cal{N}}(\mu + \phi h_0, \tau^2)

we can deduce

\displaystyle   h_0 \vert h_1 \sim {\cal{N}}(m_1,C_1)

where

\displaystyle   \begin{aligned}  \frac{1}{C_1} &= \frac{1}{C_0} + \frac{\phi^2}{\tau^2} \\  \frac{m_1}{C_1} &= \frac{m_0}{C_0} + \frac{\phi(h_1 - \mu)}{\tau^2}  \end{aligned}

> sampleH0 :: Double ->
>             Double ->
>             V.Vector Double ->
>             Double ->
>             Double ->
>             Double ->
>             RVarT m Double
> sampleH0 iC0 iC0m0 hs mu phi tau2 = do
>   let var = recip $ (iC0 + phi^2 / tau2)
>       mean = var * (iC0m0 + phi * ((hs V.! 0) - mu) / tau2)
>   rvarT (Normal mean (sqrt var))

Marginal for H_1 \ldots H_n

From the state equation, we have

\displaystyle   \begin{aligned}  H_{t+1} &=  \mu + \phi H_{t} + \tau \eta_t \\  \phi^2 H_{t} &=  -\phi\mu + \phi H_{t+1} - \phi \tau \eta_t \\  \end{aligned}

We also have

\displaystyle   \begin{aligned}  H_{t} &=  \mu + \phi H_{t-1} + \tau \eta_{t-1} \\  \end{aligned}

Adding the two expressions together gives

\displaystyle   \begin{aligned}  (1 + \phi^2)H_{t} &= \phi (H_{t-1} + H_{t+1}) + \mu (1 - \phi) + \tau(\eta_{t-1} - \phi\eta_t) \\  H_{t} &= \frac{\phi}{1 + \phi^2} (H_{t-1} + H_{t+1}) + \mu \frac{1 - \phi}{1 + \phi^2} + \frac{\tau}{1 + \phi^2}(\eta_{t-1} - \phi\eta_t) \\  \end{aligned}

Since \{\eta_t\} are standard normal, then H_t conditional on H_{t-1} and H_{t+1} is normally distributed, and

\displaystyle   \begin{aligned}  \mathbb{E}(H_n\mid H_{n-1}, H_{n+1}) &= \frac{1 - \phi}{1+\phi^2}\mu +                                          \frac{\phi}{1+\phi^2}(H_{n-1} + H_{n+1}) \\  \mathbb{V}(H_n\mid H_{n-1}, H_{n+1}) &= \frac{\tau^2}{1+\phi^2}  \end{aligned}

We also have

\displaystyle   h_{n+1} \vert h_n \sim {\cal{N}}(\mu + \phi h_n, \tau^2)

Writing

\displaystyle   \boldsymbol{h}_{-t} = \begin{bmatrix}                        h_0, &                        h_1, &                        \ldots, &                        h_{t-1}, &                        h_{t+1}, &                        \ldots, &                        h_{n-1}, &                        h_n                        \end{bmatrix}

by Bayes’ Theorem we have

\displaystyle   {p \left( h_t \,\vert\, \boldsymbol{h}_{-t}, \theta, \boldsymbol{y} \right)} \propto  {p \left( y_t \,\vert\, h_t \right)} {p \left( h_t \,\vert\, \boldsymbol{h}_{-t}, \theta, y_{-t} \right)} =  f_{\cal{N}}(y_t;0,e^{h_t}) f_{\cal{N}}(h_t;\mu_t,\nu_t^2)

where f_{\cal{N}}(x;\mu,\sigma^2) is the probability density function of a normal distribution.

We can sample from this using Metropolis

  1. For each t, sample h_t^\flat from {\cal{N}}(h_t, \gamma^2) where \gamma^2 is the tuning variance.

  2. For each t=1, \ldots, n, compute the acceptance probability

\displaystyle   p_t = \min{\Bigg(\frac{f_{\cal{N}}(h^\flat_t;\mu_t,\nu_t^2) f_{\cal{N}}(y_t;0,e^{h^\flat_t})}{f_{\cal{N}}(h_t;\mu_t,\nu_t^2) f_{\cal{N}}(y_t;0,e^{h_t})}, 1 \Bigg)}

  1. For each t, compute a new value of h_t

\displaystyle   h^\sharp_t =  \begin{cases}  h^\flat_t \text{with probability } p_t \\  h_t \text{with probability } 1 - p_t  \end{cases}

> metropolis :: V.Vector Double ->
>               Double ->
>               Double ->
>               Double ->
>               Double ->
>               V.Vector Double ->
>               Double ->
>               RVarT m (V.Vector Double)
> metropolis ys mu phi tau2 h0 hs vh = do
>   let eta2s = V.replicate (n-1) (tau2 / (1 + phi^2)) `V.snoc` tau2
>       etas  = V.map sqrt eta2s
>       coef1 = (1 - phi) / (1 + phi^2) * mu
>       coef2 = phi / (1 + phi^2)
>       mu_n  = mu + phi * (hs V.! (n-1))
>       mu_1  = coef1 + coef2 * ((hs V.! 1) + h0)
>       innerMus = V.zipWith (\hp1 hm1 -> coef1 + coef2 * (hp1 + hm1)) (V.tail (V.tail hs)) hs
>       mus = mu_1 `V.cons` innerMus `V.snoc` mu_n
>   hs' <- V.mapM (\mu -> rvarT (Normal mu vh)) hs
>   let num1s = V.zipWith3 (\mu eta h -> logPdf (Normal mu eta) h) mus etas hs'
>       num2s = V.zipWith (\y h -> logPdf (Normal 0.0 (exp (0.5 * h))) y) ys hs'
>       nums  = V.zipWith (+) num1s num2s
>       den1s = V.zipWith3 (\mu eta h -> logPdf (Normal mu eta) h) mus etas hs
>       den2s = V.zipWith (\y h -> logPdf (Normal 0.0 (exp (0.5 * h))) y) ys hs
>       dens = V.zipWith (+) den1s den2s
>   us <- V.replicate n <$> rvarT StdUniform
>   let ls   = V.zipWith (\n d -> min 0.0 (n - d)) nums dens
>   return $ V.zipWith4 (\u l h h' -> if log u < l then h' else h) us ls hs hs'

Markov Chain Monte Carlo

Now we can write down a single step for our Gibbs sampler, sampling from each marginal in turn.

> singleStep :: Double -> V.Vector Double ->
>               (Double, Double, Double, Double, V.Vector Double) ->
>               StatsM (Double, Double, Double, Double, V.Vector Double)
> singleStep vh y (mu, phi, tau2, h0, h) = do
>   lift $ tell [((mu, phi),tau2)]
>   hNew <- metropolis y mu phi tau2 h0 h vh
>   h0New <- sampleH0 iC0 iC0m0 hNew mu phi tau2
>   let bigX' = (S.col $ S.vector $ replicate n 1.0)
>               |||
>               (S.col $ S.vector $ V.toList $ h0New `V.cons` V.init hNew)
>       bigX =  bigX' `asTypeOf` (snd $ valAndType nT)
>   newParms <- sampleParms (S.vector $ V.toList h) bigX (S.vector [mu0, phi0]) invBigV0 nu0 s02
>   return ( (S.extract (fst newParms))!0
>          , (S.extract (fst newParms))!1
>          , snd newParms
>          , h0New
>          , hNew
>          )

Testing

Let’s create some test data.

> mu', phi', tau2', tau' :: Double
> mu'   = -0.00645
> phi'  =  0.99
> tau2' =  0.15^2
> tau'  = sqrt tau2'

We need to create a statically typed matrix with one dimension the same size as the data so we tie the data size value to the required type.

> nT :: Proxy 500
> nT = Proxy
> valAndType :: KnownNat n => Proxy n -> (Int, S.L n 2)
> valAndType x = (fromIntegral $ natVal x, undefined)
> n :: Int
> n = fst $ valAndType nT

Arbitrarily let us start the process at

> h0 :: Double
> h0 = 0.0

We define the process as a stream (aka co-recursively) using the Haskell recursive do construct. It is not necessary to do this but streams are a natural way to think of stochastic processes.

> hs, vols, sds, ys :: V.Vector Double
> hs = V.fromList $ take n $ fst $ runState hsAux (pureMT 1)
>   where
>     hsAux :: (MonadFix m, MonadRandom m) => m [Double]
>     hsAux = mdo { x0 <- sample (Normal (mu' + phi' * h0) tau')
>                 ; xs <- mapM (\x -> sample (Normal (mu' + phi' * x) tau')) (x0:xs)
>                 ; return xs
>                 }
> vols = V.map exp hs

We can plot the volatility (which we cannot observe directly).

And we can plot the log returns.

> sds = V.map sqrt vols
> ys = fst $ runState ysAux (pureMT 2)
>   where
>     ysAux = V.mapM (\sd -> sample (Normal 0.0 sd)) sds

We start with a vague prior for H_0

> m0, c0 :: Double
> m0 = 0.0
> c0 = 100.0

For convenience

> iC0, iC0m0 :: Double
> iC0 = recip c0
> iC0m0  = iC0 * m0

Rather than really sample from priors for \mu, \phi and \tau^2 let us cheat and assume we sampled the simulated values!

> mu0, phi0, tau20 :: Double
> mu0   = -0.00645
> phi0  =  0.99
> tau20 =  0.15^2

But that we are still very uncertain about them

> bigV0, invBigV0 :: S.Sq 2
> bigV0 = S.diag $ S.fromList [100.0, 100.0]
> invBigV0 = inv bigV0
> nu0, s02 :: Double
> nu0    = 10.0
> s02    = (nu0 - 2) / nu0 * tau20

Note that for the inverse gamma this gives

> expectationTau2, varianceTau2 :: Double
> expectationTau2 = (nu0 * s02 / 2) / ((nu0 / 2) - 1)
> varianceTau2 = (nu0 * s02 / 2)^2 / (((nu0 / 2) - 1)^2 * ((nu0 / 2) - 2))
ghci> expectationTau2
  2.25e-2

ghci> varianceTau2
  1.6874999999999998e-4

Running the Markov Chain

Tuning parameter

> vh :: Double
> vh = 0.1

The burn-in and sample sizes may be too low for actual estimation but will suffice for a demonstration.

> bigM, bigM0 :: Int
> bigM0 = 2000
> bigM  = 2000
> multiStep :: StatsM (Double, Double, Double, Double, V.Vector Double)
> multiStep = iterateM_ (singleStep vh ys) (mu0, phi0, tau20, h0, vols)
> runMC :: [((Double, Double), Double)]
> runMC = take bigM $ drop bigM0 $
>         execWriter (evalStateT (sample multiStep) (pureMT 42))

And now we can look at the distributions of our estimates

Bibliography

Doucet, Arnaud, and Adam M Johansen. 2011. “A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later.” In Handbook of Nonlinear Filtering. Oxford, UK: Oxford University Press.

Fridman, Moshe, and Lawrence Harris. 1998. “A Maximum Likelihood Approach for Non-Gaussian Stochastic Volatility Models.” Journal of Business & Economic Statistics 16 (3): 284–91.

Geweke, John. 1994. “Bayesian Comparison of Econometric Models.”

Jacquier, Eric, Nicholas G. Polson, and Peter E. Rossi. 1994. “Bayesian Analysis of Stochastic Volatility Models.”

Kim, Sangjoon, Neil Shephard, and Siddhartha Chib. 1998. “Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models.” Review of Economic Studies 65 (3): 361–93. http://ideas.repec.org/a/bla/restud/v65y1998i3p361-93.html.

A Type Safe Reverse or Some Hasochism

Conor McBride was not joking when he and his co-author entitled their paper about dependent typing in Haskell “Hasochism”: Lindley and McBride (2013).

In trying to resurrect the Haskell package yarr, it seemed that a dependently typed reverse function needed to be written. Writing such a function turns out to be far from straightforward. How GHC determines that a proof (program) discharges a proposition (type signature) is rather opaque and perhaps not surprisingly the error messages one gets if the proof is incorrect are far from easy to interpret.

I’d like to thank all the folk on StackOverflow whose answers and comments I have used freely below. Needless to say, any errors are entirely mine.

Here are two implementations, each starting from different axioms (NB: I have never seen type families referred to as axioms but it seems helpful to think about them in this way).

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE GADTs                        #-}
> {-# LANGUAGE KindSignatures               #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE TypeFamilies                 #-}
> {-# LANGUAGE UndecidableInstances         #-}
> {-# LANGUAGE ExplicitForAll               #-}
> {-# LANGUAGE TypeOperators                #-}
> {-# LANGUAGE ScopedTypeVariables          #-}

For both implementations, we need propositional equality: if a :~: b is inhabited by some terminating value, then the type a is the same as the type b. Further we need the normal form of an equality proof: Refl :: a :~: a and a function, gcastWith which allows us to use internal equality (:~:) to discharge a required proof of external equality (~). Readers familiar with topos theory, for example see Lambek and Scott (1988), will note that the notation is reversed.

> import Data.Type.Equality ( (:~:) (Refl), gcastWith )

For the second of the two approaches adumbrated we will need

> import Data.Proxy

The usual natural numbers:

> data Nat = Z | S Nat

We need some axioms:

  1. A left unit and
  2. Restricted commutativity.
> type family (n :: Nat) :+ (m :: Nat) :: Nat where
>     Z   :+ m = m
>     S n :+ m = n :+ S m

We need the usual singleton for Nat to tie types and terms together.

> data SNat :: Nat -> * where
>   SZero :: SNat Z
>   SSucc :: SNat n -> SNat (S n)

Now we can prove some lemmas.

First a lemma showing we can push :+ inside a successor, S.

> succ_plus_id :: SNat n1 -> SNat n2 -> (((S n1) :+ n2) :~: (S (n1 :+ n2)))
> succ_plus_id SZero _ = Refl
> succ_plus_id (SSucc n) m = gcastWith (succ_plus_id n (SSucc m)) Refl

This looks nothing like a standard mathematical proof and it’s hard to know what ghc is doing under the covers but presumably something like this:

  • For SZero
  1. S Z :+ n2 = Z :+ S n2 (by axiom 2) = S n2 (by axiom 1) and
  2. S (Z + n2) = S n2 (by axiom 1)
  3. So S Z :+ n2 = S (Z + n2)
  • For SSucc
  1. SSucc n :: SNat (S k) so n :: SNat k and m :: SNat i so SSucc m :: SNat (S i)
  2. succ_plus id n (SSucc m) :: k ~ S p => S p :+ S i :~: S (p :+ S i) (by hypothesis)
  3. k ~ S p => S p :+ S i :~: S (S p :+ i) (by axiom 2)
  4. k :+ S i :~: S (k :+ i) (by substitution)
  5. S k :+ i :~: S (k :+ i) (by axiom 2)

Second a lemma showing that Z is also the right unit.

> plus_id_r :: SNat n -> ((n :+ Z) :~: n)
> plus_id_r SZero = Refl
> plus_id_r (SSucc n) = gcastWith (plus_id_r n) (succ_plus_id n SZero)
  • For SZero
  1. Z :+ Z = Z (by axiom 1)
  • For SSucc
  1. SSucc n :: SNat (S k) so n :: SNat k
  2. plus_id_r n :: k :+ Z :~: k (by hypothesis)
  3. succ_plus_id n SZero :: S k :+ Z :~: S (k + Z) (by the succ_plus_id lemma)
  4. succ_plus_id n SZero :: k :+ Z ~ k => S k :+ Z :~: S k (by substitution)
  5. plus_id_r n :: k :+ Z :~: k (by equation 2)

Now we can defined vectors which have their lengths encoded in their type.

> infixr 4 :::
> data Vec a n where
>     Nil   :: Vec a Z
>     (:::) :: a -> Vec a n -> Vec a (S n)

We can prove a simple result using the lemma that Z is a right unit.

> elim0 :: SNat n -> (Vec a (n :+ Z) -> Vec a n)
> elim0 n x = gcastWith (plus_id_r n) x

Armed with this we can write an {\mathcal{O}}(n) reverse function in which the length of the result is guaranteed to be the same as the length of the argument.

> size :: Vec a n -> SNat n
> size Nil         = SZero
> size (_ ::: xs)  = SSucc $ size xs
> accrev :: Vec a n -> Vec a n
> accrev x = elim0 (size x) $ go Nil x where
>     go :: Vec a m -> Vec a n -> Vec a (n :+ m)
>     go acc  Nil       = acc
>     go acc (x ::: xs) = go (x ::: acc) xs
> toList :: Vec a n -> [a]
> toList  Nil       = []
> toList (x ::: xs) = x : toList xs
> test0 :: [String]
> test0 = toList $ accrev $ "a" ::: "b" ::: "c" ::: Nil
ghci> test0
  ["c","b","a"]

For an alternative approach, let us change the axioms slightly.

> type family (n1 :: Nat) + (n2 :: Nat) :: Nat where
>   Z + n2 = n2
>   (S n1) + n2 = S (n1 + n2)

Now the proof that Z is a right unit is more straightforward.

> plus_id_r1 :: SNat n -> ((n + Z) :~: n)
> plus_id_r1 SZero = Refl
> plus_id_r1 (SSucc n) = gcastWith (plus_id_r1 n) Refl

For the lemma showing we can push + inside a successor, S, we can use a Proxy.

> plus_succ_r1 :: SNat n1 -> Proxy n2 -> ((n1 + (S n2)) :~: (S (n1 + n2)))
> plus_succ_r1 SZero _ = Refl
> plus_succ_r1 (SSucc n1) proxy_n2 = gcastWith (plus_succ_r1 n1 proxy_n2) Refl

Now we can write our reverse function without having to calculate sizes.

> accrev1 :: Vec a n -> Vec a n
> accrev1 Nil = Nil
> accrev1 list = go SZero Nil list
>   where
>     go :: SNat n1 -> Vec a n1 -> Vec a n2 -> Vec a (n1 + n2)
>     go snat acc Nil = gcastWith (plus_id_r1 snat) acc
>     go snat acc (h ::: (t :: Vec a n3)) =
>       gcastWith (plus_succ_r1 snat (Proxy :: Proxy n3))
>                 (go (SSucc snat) (h ::: acc) t)
> test1 :: [String]
> test1 = toList $ accrev1 $ "a" ::: "b" ::: "c" ::: Nil
ghci> test0
  ["c","b","a"]

Bibliography

Lambek, J., and P.J. Scott. 1988. Introduction to Higher-Order Categorical Logic. Cambridge Studies in Advanced Mathematics. Cambridge University Press. http://books.google.co.uk/books?id=6PY_emBeGjUC.

Lindley, Sam, and Conor McBride. 2013. “Hasochism: The Pleasure and Pain of Dependently Typed Haskell Programming.” In Proceedings of the 2013 ACM SIGPLAN Symposium on Haskell, 81–92. Haskell ’13. New York, NY, USA: ACM. doi:10.1145/2503778.2503786.

Stopping Times

Let S an T be stopping times and let the filtration on which they are defined be right continuous. Then

  1. S \land T = \min(S, T),
  2. S \lor T = \max(S, T),
  3. S + T and
  4. \alpha T

are stopping times where \alpha > 1.

For the first we have \{S \land T \leq t\} = \{S \leq t\} \cup \{T \leq t\} and both the latter are in {\mathcal{F}_t} by the definition of a stopping time.

Similarly for the second \{S \lor T \leq t\} = \{S \leq t\} \cap \{T \leq t\}.

For the fourth we have \{\alpha T \leq t\} = \{T \leq \frac{t}{\alpha}\} \in {\mathcal{F}}_{t / \alpha} \subset {\mathcal{F}}_{t} since \alpha > 1.

The third is slightly trickier. For \omega \in \Omega, S(\omega) +T(\omega) < t if and only if for some rational q, we have S(\omega) + T(\omega) < q < t. We can thus we can find r \in \mathbb{Q} such that S(\omega) < r < q - T(\omega). Writing s \triangleq q - r \in \mathbb{Q} we also have T(\omega) < q - r = s. Thus we have S(\omega) + T(\omega) < t if and only if there exist r, s \in \mathbb{Q} and r, s > 0 such that r + s < t and S(\omega) < r and T(\omega) < s. In other words

\displaystyle   \{S +T < t\} = \bigcup_{r, s \in \mathbb{Q}^+} (\{S < r\} \cap \{T < s\})

By right continuity (Protter 2004 Theorem 1) of the filtration, we know the terms on the right hand side are in {\mathcal{F}}_r \subset {\mathcal{F}}_t and {\mathcal{F}}_s \subset {\mathcal{F}}_t so that the whole right hand side is in {\mathcal{F}}_t. We thus know that the left hand side is in {\mathcal{F}}_t and using right continuity again that therefore S + T must be a stopping time.

Bibliography

Protter, P.E. 2004. Stochastic Integration and Differential Equations: Version 2.1. Applications of Mathematics. Springer. http://books.google.co.uk/books?id=mJkFuqwr5xgC.