Estimating Parameters in Chaotic Systems

Post available from my new site. Sadly WordPress doesn’t allow me to render the html exported by a Jupyter notebook.

Advertisements

Trouble with Tribbles

Introduction

Tribbles originate from the planet Iota Geminorum IV and, according to Dr. McCoy, are born pregnant. No further details are given but we can follow Gurtin and MacCamy (1974) and perhaps recover some of what happens on the Enterprise.

Of course, age-dependent population models are of more than fictional use and can be applied, for example, to modelling the progression of Malaria in infected hosts. We roughly follow some of J. J. Thibodeaux and Schlittenhardt (2011) who themselves reference Belair, Mackey, and Mahaffy (1995).

Of interest to Haskellers are:

  • The use of the hmatrix package which now contains functions to solve tridiagonal systems used in this post. You will need to use HEAD until a new hackage / stackage release is made. My future plan is to use CUDA via accelerate and compare.

  • The use of dimensions in a medium-sized example. It would have been nice to have tried the units package but it seemed harder work to use and, as ever, “Time’s wingèd chariot” was the enemy.

The source for this post can be downloaded from github.

Age-Dependent Populations

McKendrick / von Foerster

McKendrick and von Foerster independently derived a model of age-dependent population growth.

Let n(a,t) be the density of females of age a at time t. The number of females between ages a and a + \delta a are thus n(a, t)\delta a. Assuming individuals are born at age 0, we have

\displaystyle   \frac{\partial}{\partial t}(n(a, t)\delta a) =  J(a, t) - J(a + \delta a, t) - \mu(a, t)n(a, t)\delta a

where \mu(a, t) is the death rate density and J(a, t) denotes the rate of entry to the cohort of age a. Dividing by \delta a we obtain

\displaystyle   \frac{\partial}{\partial t}n(a, t) =   - \frac{J(a + \delta a, t) - J(a, t)}{\delta a} - \mu(a, t)n(a, t)

which in the limit becomes

\displaystyle   \frac{\partial}{\partial t}n(a, t) =   - \frac{\partial J(a, t)}{\partial a} - \mu(a, t)n(a, t)

We can further assume that the rate of entry to a cohort is proportional to the density of individuals times a velocity of aging v(a, t).

\displaystyle   J(a, t) = n(a, t)v(a, t)

Occasionally there is some reason to assume that aging one year is different to experiencing one year but we further assume v = 1.

We thus obtain

\displaystyle   \frac{\partial n(a, t)}{\partial t} + \frac{\partial n(a, t)}{\partial a} =  - \mu(a, t)n(a, t)

Gurtin / MacCamy

To solve any PDE we need boundary and initial conditions. The number of births at time t is

\displaystyle   n(0, t) = \int_0^\infty n(a, t) m(a, N(t))\, \mathrm{d}a

where m is the natality aka birth-modulus and

\displaystyle   N(t) = \int_0^\infty n(a, t)\, \mathrm{d}a

and we further assume that the initial condition

\displaystyle   n(a, 0) = n_0(a)

for some given n_0.

Gurtin and MacCamy (1974) focus on the situation where

\displaystyle   m(a, N(t)) = \beta(N)e^{-\alpha a}

and we can also assume that the birth rate of Tribbles decreases exponentially with age and further that Tribbles can live forever. Gurtin and MacCamy (1974) then transform the PDE to obtain a pair of linked ODEs which can then be solved numerically.

Of course, we know what happens in the Enterprise and rather than continue with this example, let us turn our attention to the more serious subject of Malaria.

Malaria

I realise now that I went a bit overboard with references. Hopefully they don’t interrupt the flow too much.

The World Health Organisation (WHO) estimated that in 2015 there were 214 million new cases of malaria resulting in 438,000 deaths (source: Wikipedia).

The lifecycle of the plasmodium parasite that causes malaria is extremely ingenious. J. J. Thibodeaux and Schlittenhardt (2011) model the human segment of the plasmodium lifecycle and further propose a way of determing an optimal treatment for an infected individual. Hall et al. (2013) also model the effect of an anti-malarial. Let us content ourselves with reproducing part of the paper by J. J. Thibodeaux and Schlittenhardt (2011).

At one part of its sojourn in humans, plasmodium infects erythrocytes aka red blood cells. These latter contain haemoglobin (aka hemoglobin). The process by which red blood cells are produced, Erythropoiesis, is modulated in a feedback loop by erythropoietin. The plasmodium parasite severely disrupts this process. Presumably the resulting loss of haemoglobin is one reason that an infected individual feels ill.

As can be seen in the overview by Torbett and Friedman (2009), the full feedback loop is complex. So as not to lose ourselves in the details and following J. J. Thibodeaux and Schlittenhardt (2011) and Belair, Mackey, and Mahaffy (1995), we consider a model with two compartments.

  • Precursors: prototype erythrocytes developing in the bone marrow with p(\mu, t) being the density of such cells of age \mu at time t.

  • Erythrocytes: mature red blood cells circulating in the blood with m(\mu, t) being the density of such cells of age \nu at time t.

\displaystyle   \begin{aligned}  \frac{\partial p(\mu, t)}{\partial t} + g(E(t))\frac{\partial p(\mu, t)}{\partial \mu} &=  \sigma(\mu, t, E(t))p(\mu, t) & 0 < \mu < \mu_F & & 0 < t < T \\  \frac{\partial m(\nu, t)}{\partial t} + \phantom{g(E(t))}\frac{\partial m(\nu, t)}{\partial \nu} &=  -\gamma(\nu, t, M(t))m(\nu, t) & 0 < \nu < \nu_F & &  0 < t < T  \end{aligned}

where \sigma(\mu, t, E(t)) is the birth rate of precursors and \gamma(\nu, t, M(t)) is the death rate of erythrocytes, g(E(t)) is the maturation rate of precursors and where

\displaystyle   M(t) = \int_0^{\nu_F} p(\nu, t) \,\mathrm{d}\nu

As boundary conditions, we have that the number of precursors maturing must equal the production of number of erythrocytes

\displaystyle   m(0, t) = g(E(t))p(\mu_F, t)

and the production of the of the number of precursors depends on the level of erythropoietin

\displaystyle   g(E(t))p(0, t) = \phi(t)E(t)

where \phi(t) is some proportionality function.

As initial conditions, we have

\displaystyle   \begin{aligned}  p(\mu, 0) &= p_0(\mu) \\  m(\nu, 0) &= m_0(\nu)  \end{aligned}

We can further model the erythropoietin dynamics as

\displaystyle   \frac{\mathrm{d}E(t)}{\mathrm{d}t} = f(M(t), t) - a_E(P(t))E(t)

where f is the feedback function from the kidneys and the decay rate, a_E depends on the total precursor population P(t) (Sawyer, Krantz, and Goldwasser (1987)) although this often is taken to be a constant and I would feel more comfortable with a more recent citation and where

\displaystyle   P(t) = \int_0^{\mu_F} p(\mu, t) \,\mathrm{d}\mu

As initial condition we have

\displaystyle   E(0) = E_0

A Finite Difference Attempt

Let us try solving the above model using a finite difference scheme observing that we currently have no basis for whether it has a solution and whether the finite difference scheme approximates such a solution! We follow J. J. Thibodeaux and Schlittenhardt (2011) who give a proof of convergence presumably with some conditions; any failure of the scheme is entirely mine.

Divide up the age and time ranges, [0, \mu_F], [0, \nu_F] and [0, T] into equal sub-intervals, [\mu_i, \mu_{i+1}], [\nu_j, \nu_{j+1}] and [t_k, t_{k+1}] where

\displaystyle   \begin{aligned}  \mu_i &= i\Delta\mu & & \mathrm{for} & i = 1 \ldots n_1 \\  \nu_j &= j\Delta\nu & & \mathrm{for} & j = 1 \ldots n_2 \\  t_k   &= k\Delta t  & & \mathrm{for} & k = 1 \ldots K  \end{aligned}

where \Delta\mu = \mu_F / n_1, \Delta\nu = \nu_F / n_2 and \Delta t = T / K.

Denoting p(\mu_i, t_k) = p_i^k and similarly we obtain

\displaystyle   \begin{aligned}  \frac{p_i^{k+1} - p_i^k}{\Delta t} + g^k\frac{p_i^{k+1} - p_{i-1}^{k+1}}{\Delta\mu} &= \sigma_i^k p_i^{k+1} \\  \frac{m_j^{k+1} - m_j^k}{\Delta t} + \phantom{g^k}\frac{m_j^{k+1} - m_{j-1}^{k+1}}{\Delta\mu} &= -\gamma_j^k m_j^{k+1}  \end{aligned}

and

\displaystyle   \begin{aligned}  \frac{E^{k+1} - E^k}{\Delta t} &= f^k - a_E^k E^{k+1} \\  g^k p_0^{k+1} &= \phi^k E^k \\  m_0^{k+1}     &= g^k m_{n_1}^{k+1}  \end{aligned}

Re-arranging we get

\displaystyle   \begin{aligned}  -g^k\frac{\Delta t}{\Delta \mu}p_{i-1}^{k+1} +  \bigg(1 + g^k\frac{\Delta t}{\Delta \mu} - \Delta t \sigma_i^k\bigg)p_i^{k+1} &=  p_i^k \\  \frac{\Delta t}{\Delta \mu}m_{j-1}^{k+1} +  \bigg(1 + \frac{\Delta t}{\Delta \mu} + \Delta t \gamma_j^k\bigg)m_j^{k+1} &=  m_j^k  \end{aligned}

Writing

\displaystyle   \begin{aligned}  d_{1,i}^k &= 1 + g^k\frac{\Delta t}{\Delta \mu} - \Delta t \sigma_i^k \\  d_{2,i}^k &= 1 + \frac{\Delta t}{\Delta \nu} - \Delta t \gamma_i^k  \end{aligned}

We can express the above in matrix form

\displaystyle   \begin{bmatrix}  g^k & 0 & 0 & \ldots & 0 & 0 \\  -g^k\frac{\Delta t}{\Delta \mu} & d_{1,1}^k & 0 & \ldots & 0 & 0\\  0 & -g^k\frac{\Delta t}{\Delta \mu} & d_{1,2}^k & \ldots & 0 & 0 \\  \ldots & \ldots & \ldots & \ldots & \ldots & \ldots \\  0 & 0 & 0 & \ldots &\ -g^k\frac{\Delta t}{\Delta \mu} & d_{1,n_1}^k \\  \end{bmatrix}  \begin{bmatrix}  p_0^{k+1} \\  p_1^{k+1} \\  p_2^{k+1} \\  \ldots \\  p_{n_1}^{k+1}  \end{bmatrix}  =  \begin{bmatrix}  \phi^k E^k \\  p_1^k \\  p_2^k \\  \ldots \\  p_{n_1}^k \\  \end{bmatrix}

\displaystyle   \begin{bmatrix}  1 & 0 & 0 & \ldots & 0 & 0 \\  -\frac{\Delta t}{\Delta \mu} & d_{2,1}^k & 0 & \ldots & 0 & 0\\  0 & -\frac{\Delta t}{\Delta \mu} & d_{2,2}^k & \ldots & 0 & 0 \\  \ldots & \ldots & \ldots & \ldots & \ldots & \ldots \\  0 & 0 & 0 & \ldots &\ -\frac{\Delta t}{\Delta \mu} & d_{2,n_1}^k \\  \end{bmatrix}  \begin{bmatrix}  m_0^{k+1} \\  m_1^{k+1} \\  m_2^{k+1} \\  \ldots \\  m_{n_2}^{k+1}  \end{bmatrix}  =  \begin{bmatrix}  g^k p_{n_1}^{k+1} \\  m_1^k \\  m_2^k \\  \ldots \\  m_{n_1}^k \\  \end{bmatrix}

Finally we can write

\displaystyle   E^{k+1} = \frac{E^k + \Delta t f^k}{1 + a_E^k\Delta T}

A Haskell Implementation

> {-# OPTIONS_GHC -Wall #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE NoImplicitPrelude #-}
> {-# LANGUAGE FlexibleContexts #-}
> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE TypeOperators #-}
> module Tribbles where
> import qualified Prelude as P
> import Numeric.Units.Dimensional.Prelude hiding (Unit)
> import Numeric.Units.Dimensional
> import Numeric.LinearAlgebra
> import Numeric.Integration.TanhSinh
> import Control.Monad.Writer
> import Control.Monad.Loops

Substances like erythropoietin (EPO) are measured in International Units and these cannot be converted to Moles (see Jelkmann (2009) for much more detail) so we have to pretend it really is measured in Moles as there seems to be no easy way to define what the dimensional package calls a base dimension. A typical amount for a person is 15 milli-IU / mill-litre but can reach much higher levels after loss of blood.

> muPerMl :: (Fractional a, Num a) => Unit 'NonMetric DConcentration a
> muPerMl = (milli mole) / (milli litre)
> bigE'0 :: Concentration Double
> bigE'0 = 15.0 *~ muPerMl

Let’s set up our grid. We take these from Ackleh et al. (2006) but note that J. J. Thibodeaux and Schlittenhardt (2011) seem to have T = 20.

> deltaT, deltaMu, deltaNu :: Time Double
> deltaT = 0.05 *~ day
> deltaMu = 0.01 *~ day
> deltaNu = 0.05 *~ day
> bigT :: Time Double
> bigT = 100.0 *~ day
> muF, nuF :: Time Double
> muF = 5.9 *~ day
> nuF = 120.0 *~ day
> bigK :: Int
> bigK = floor (bigT / deltaT /~ one)
> n1 :: Int
> n1 = floor (muF / deltaMu /~ one)
> n2 :: Int
> n2 = floor (nuF / deltaNu /~ one)
> ts :: [Time Double]
> ts = take bigK $ 0.0 *~ day : (map (+ deltaT) ts)

The birth rate for precursors

> betaThibodeaux :: Time Double ->
>                   Frequency Double
> betaThibodeaux mu
>   | mu < (0 *~ day) = error "betaThibodeaux: negative age"
>   | mu < (3 *~ day) = (2.773 *~ (one / day))
>   | otherwise       = (0.0 *~ (one /day))
> alphaThibodeaux :: Concentration Double ->
>                    Frequency Double
> alphaThibodeaux e = (0.5 *~ (muPerMl / day)) / ((1 *~ muPerMl) + e)
> sigmaThibodeaux :: Time Double ->
>                    Time Double ->
>                    Concentration Double ->
>                    Frequency Double
> sigmaThibodeaux mu _t e = gThibodeaux e * (betaThibodeaux mu - alphaThibodeaux e)

and an alternative birth rate

> betaAckleh :: Time Double -> Frequency Double
> betaAckleh mu
>   | mu < (0 *~ day) = error "betaAckleh: negative age"
>   | mu < (3 *~ day) = 2.773 *~ (one / day)
>   | otherwise       = 0.000 *~ (one / day)
> sigmaAckleh :: Time Double ->
>                Time Double ->
>                Concentration Double ->
>                Frequency Double
> sigmaAckleh mu _t e = betaAckleh mu * gAckleh e

J. J. Thibodeaux and Schlittenhardt (2011) give the maturation rate of precursors g as

> gThibodeaux :: Concentration Double  -> Dimensionless Double
> gThibodeaux e = d / n
>   where
>     n = ((3.02 *~ one) * e + (0.31 *~ muPerMl))
>     d = (30.61 *~ muPerMl) + e

and Ackleh et al. (2006) give this as

> gAckleh :: Concentration Double -> Dimensionless Double
> gAckleh _e = 1.0 *~ one

As in J. J. Thibodeaux and Schlittenhardt (2011) we give quantities in terms of cells per kilogram of body weight. Note that these really are moles on this occasion.

> type CellDensity = Quantity (DAmountOfSubstance / DTime / DMass)

Let’s set the initial conditions.

> p'0 :: Time Double -> CellDensity Double
> p'0 mu' = (1e11 *~ one) * pAux mu'
>   where
>     pAux mu
>       | mu < (0 *~ day) = error "p'0: negative age"
>       | mu < (3 *~ day) = 8.55e-6 *~ (mole / day / kilo gram) *
>                           exp ((2.7519 *~ (one / day)) * mu)
>       | otherwise       = 8.55e-6 *~ (mole / day / kilo gram) *
>                           exp (8.319 *~ one - (0.0211 *~ (one / day)) * mu)
> m_0 :: Time Double -> CellDensity Double
> m_0 nu' = (1e11 *~ one) * mAux nu'
>   where
>     mAux nu
>       | nu < (0 *~ day) = error "m_0: age less than zero"
>       | otherwise       = 0.039827  *~ (mole / day / kilo gram) *
>                           exp (((-0.0083) *~ (one / day)) * nu)

And check that these give plausible results.

> m_0Untyped :: Double -> Double
> m_0Untyped nu = m_0 (nu *~ day) /~ (mole / day / kilo gram)
> p'0Untyped :: Double -> Double
> p'0Untyped mu = p'0 (mu *~ day) /~ (mole / day / kilo gram)
ghci> import Numeric.Integration.TanhSinh
ghci> result $ relative 1e-6 $ parTrap m_0Untyped 0.001 (nuF /~ day)
  3.0260736659043414e11

ghci> result $ relative 1e-6 $ parTrap p'0Untyped 0.001 (muF /~ day)
  1.0453999900927126e10

We can now create the components for the first matrix equation.

> g'0 :: Dimensionless Double
> g'0 = gThibodeaux bigE'0
> d_1'0 :: Int -> Dimensionless Double
> d_1'0 i = (1 *~ one) + (g'0 * deltaT / deltaMu)
>           - deltaT * sigmaThibodeaux ((fromIntegral i *~ one) * deltaMu) undefined bigE'0
> lowers :: [Dimensionless Double]
> lowers = replicate n1 (negate $ g'0 * deltaT / deltaMu)
> diags :: [Dimensionless Double]
> diags = g'0 : map d_1'0 [1..n1]
> uppers :: [Dimensionless Double]
> uppers = replicate n1 (0.0 *~ one)

J. J. Thibodeaux and Schlittenhardt (2011) does not give a definition for \phi so we use the equivalent s_0 from Ackleh et al. (2006) which references Banks et al. (2003): “\times 10^{11} erythrocytes/kg body weight \times mL plasma/mU Epo/day”

> s_0 :: Time Double ->
>        Quantity (DAmountOfSubstance / DTime / DMass / DConcentration) Double
> s_0 = const ((1e11 *~ one) * (4.45e-7 *~ (mole / day / kilo gram / muPerMl)))
> b'0 :: [CellDensity Double]
> b'0 = (s_0 (0.0 *~ day) * bigE'0) : (take n1 $ map p'0 (iterate (+ deltaMu) deltaMu))

With these components in place we can now solve the implicit scheme and get the age distribution of precursors after one time step.

> p'1 :: Matrix Double
> p'1 = triDiagSolve (fromList (map (/~ one) lowers))
>                    (fromList (map (/~ one) diags))
>                    (fromList (map (/~ one) uppers))
>                    (((n1 P.+1 )><1) (map (/~ (mole / second / kilo gram)) b'0))

In order to create the components for the second matrix equation, we need the death rates of mature erythrocytes

> gammaThibodeaux :: Time Double ->
>                    Time Double ->
>                    Quantity (DAmountOfSubstance / DMass) Double ->
>                    Frequency Double
> gammaThibodeaux _nu _t _bigM = 0.0083 *~ (one / day)

We note an alternative for the death rate

> gammaAckleh :: Time Double ->
>                Time Double ->
>                Quantity (DAmountOfSubstance / DMass) Double ->
>                Frequency Double
> gammaAckleh _nu _t bigM = (0.01 *~ (kilo gram / mole / day)) * bigM + 0.0001 *~ (one / day)

For the intial mature erythrocyte population we can either use the integral of the initial distribution

> bigM'0 :: Quantity (DAmountOfSubstance / DMass) Double
> bigM'0 = r *~ (mole / kilo gram)
>  where
>    r = result $ relative 1e-6 $ parTrap m_0Untyped 0.001 (nuF /~ day)
ghci> bigM'0
  3.0260736659043414e11 kg^-1 mol

or we can use the sum of the values used in the finite difference approximation

> bigM'0' :: Quantity (DAmountOfSubstance / DMass) Double
> bigM'0' = (* deltaNu) $ sum $ map m_0 $ take n2 $ iterate (+ deltaNu) (0.0 *~ day)
ghci> bigM'0'
  3.026741454719976e11 kg^-1 mol

Finally we can create the components

> d_2'0 :: Int -> Dimensionless Double
> d_2'0 i = (1 *~ one) + (g'0 * deltaT / deltaNu)
>           + deltaT * gammaThibodeaux ((fromIntegral i *~ one) * deltaNu) undefined bigM'0
> lowers2 :: [Dimensionless Double]
> lowers2 = replicate n2 (negate $ deltaT / deltaNu)
> diags2 :: [Dimensionless Double]
> diags2 = (1.0 *~ one) : map d_2'0 [1..n2]
> uppers2 :: [Dimensionless Double]
> uppers2 = replicate n2 (0.0 *~ one)
> b_2'0 :: [CellDensity Double]
> b_2'0 = (g'0 * ((p'1 `atIndex` (n1,0)) *~ (mole / second / kilo gram))) :
>         (take n2 $ map m_0 (iterate (+ deltaNu) deltaNu))

and then solve the implicit scheme to get the distribution of mature erythrocytes one time step ahead

> m'1 :: Matrix Double
> m'1 = triDiagSolve (fromList (map (/~ one) lowers2))
>                    (fromList (map (/~ one) diags2))
>                    (fromList (map (/~ one) uppers2))
>                    (((n2 P.+ 1)><1) (map (/~ (mole / second / kilo gram)) b_2'0))

We need to complete the homeostatic loop by implmenting the feedback from the kidneys to the bone marrow. Ackleh and Thibodeaux (2013) and Ackleh et al. (2006) give f as

> fAckleh :: Time Double ->
>            Quantity (DAmountOfSubstance / DMass) Double ->
>            Quantity (DConcentration / DTime) Double
> fAckleh _t bigM = a / ((1.0 *~ one) + k * (bigM' ** r))
>   where
>     a = 15600 *~ (muPerMl / day)
>     k = 0.0382 *~ one
>     r = 6.96 *~ one
>     bigM' = ((bigM /~ (mole / kilo gram)) *~ one) * (1e-11 *~ one)

The much older Belair, Mackey, and Mahaffy (1995) gives f as

> fBelair :: Time Double ->
>            Quantity (DAmountOfSubstance / DMass) Double ->
>            Quantity (DConcentration / DTime) Double
> fBelair _t bigM = a / ((1.0 *~ one) + k * (bigM' ** r))
>   where
>     a = 6570 *~ (muPerMl / day)
>     k = 0.0382 *~ one
>     r = 6.96 *~ one
>     bigM' = ((bigM /~ (mole / kilo gram)) *~ one) * (1e-11 *~ one)

For the intial precursor population we can either use the integral of the initial distribution

result $ relative 1e-6 $ parTrap p'0Untyped 0.001 (muF /~ day)
> bigP'0 :: Quantity (DAmountOfSubstance / DMass) Double
> bigP'0 = r *~ (mole / kilo gram)
>  where
>    r = result $ relative 1e-6 $ parTrap p'0Untyped 0.001 (muF /~ day)
ghci> bigP'0
  1.0453999900927126e10 kg^-1 mol

or we can use the sum of the values used in the finite difference approximation

> bigP'0' :: Quantity (DAmountOfSubstance / DMass) Double
> bigP'0' = (* deltaMu) $ sum $ map p'0 $ take n1 $ iterate (+ deltaMu) (0.0 *~ day)
ghci> bigP'0'
  1.0438999930030743e10 kg^-1 mol

J. J. Thibodeaux and Schlittenhardt (2011) give the following for a_E

> a_E :: Quantity (DAmountOfSubstance / DMass) Double -> Frequency Double
> a_E bigP = ((n / d) /~ one) *~ (one / day)
>   where
>     n :: Dimensionless Double
>     n = bigP * (13.8 *~ (kilo gram / mole)) + 0.04 *~ one
>     d :: Dimensionless Double
>     d = (bigP /~ (mole / kilo gram)) *~ one + 0.08 *~ one

but from Ackleh et al. (2006)

The only biological basis for the latter is that the decay rate of erythropoietin should be an increasing function of the precursor population and this function remains in the range 0.50–6.65 \mathrm{days}^{-1}

and, given this is at variance with their given function, it may be safer to use their alternative of

> a_E' :: Quantity (DAmountOfSubstance / DMass) Double -> Frequency Double
> a_E' _bigP = 6.65 *~ (one / day)

We now further calculate the concentration of EPO one time step ahead.

> f'0 :: Quantity (DConcentration / DTime) Double
> f'0 = fAckleh undefined bigM'0
> bigE'1 :: Concentration Double
> bigE'1 = (bigE'0 + deltaT * f'0) / (1.0 *~ one + deltaT * a_E' bigP'0)

Having done this for one time step starting at t=0, it’s easy to generalize this to an arbitrary time step.

> d_1 :: Dimensionless Double ->
>        Concentration Double ->
>        Int ->
>        Dimensionless Double
> d_1 g e i = (1 *~ one) + (g * deltaT / deltaMu)
>           - deltaT * sigmaThibodeaux ((fromIntegral i *~ one) * deltaMu) undefined e
> d_2 :: Quantity (DAmountOfSubstance / DMass) Double ->
>        Int ->
>        Dimensionless Double
> d_2 bigM i = (1 *~ one) + deltaT / deltaNu
>            + deltaT * gammaThibodeaux ((fromIntegral i *~ one) * deltaNu) undefined bigM
> oneStepM :: (Matrix Double, Matrix Double, Concentration Double, Time Double) ->
>             Writer [(Quantity (DAmountOfSubstance / DMass) Double,
>                      Quantity (DAmountOfSubstance / DMass) Double,
>                      Concentration Double)]
>                    (Matrix Double, Matrix Double, Concentration Double, Time Double)
> oneStepM (psPrev, msPrev, ePrev, tPrev) = do
>   let
>     g  = gThibodeaux ePrev
>     ls = replicate n1 (negate $ g * deltaT / deltaMu)
>     ds = g : map (d_1 g ePrev)  [1..n1]
>     us = replicate n1 (0.0 *~ one)
>     b1'0 = (s_0 tPrev * ePrev) /~ (mole / second / kilo gram)
>     b1 = asColumn $ vjoin [scalar b1'0, subVector 1 n1 $ flatten psPrev]
>     psNew :: Matrix Double
>     psNew = triDiagSolve (fromList (map (/~ one) ls))
>                          (fromList (map (/~ one) ds))
>                          (fromList (map (/~ one) us))
>                          b1
>     ls2 = replicate n2 (negate $ deltaT / deltaNu)
>     bigM :: Quantity (DAmountOfSubstance / DMass) Double
>     bigM = (* deltaNu) $ ((sumElements msPrev) *~ (mole / kilo gram / second))
>     ds2 = (1.0 *~ one) : map (d_2 bigM) [1..n2]
>     us2 = replicate n2 (0.0 *~ one)
>     b2'0 = (g * (psNew `atIndex` (n1, 0) *~ (mole / second / kilo gram))) /~
>            (mole / second / kilo gram)
>     b2 = asColumn $ vjoin [scalar b2'0, subVector 1 n2 $ flatten msPrev]
>     msNew :: Matrix Double
>     msNew = triDiagSolve (fromList (map (/~ one) ls2))
>                          (fromList (map (/~ one) ds2))
>                          (fromList (map (/~ one) us2))
>                          b2
>     bigP :: Quantity (DAmountOfSubstance / DMass) Double
>     bigP = (* deltaMu) $ sumElements psPrev *~ (mole / kilo gram / second)
>     f :: Quantity (DConcentration / DTime) Double
>     f = fAckleh undefined bigM
>     eNew :: Concentration Double
>     eNew = (ePrev + deltaT * f) / (1.0 *~ one + deltaT * a_E' bigP)
>     tNew = tPrev + deltaT
>   tell [(bigP, bigM, ePrev)]
>   return (psNew, msNew, eNew, tNew)

We can now run the model for 100 days.

> ys :: [(Quantity (DAmountOfSubstance / DMass) Double,
>         Quantity (DAmountOfSubstance / DMass) Double,
>         Concentration Double)]
> ys = take 2000 $
>      snd $
>      runWriter $
>      iterateM_ oneStepM ((((n1 P.+1 )><1) (map (/~ (mole / second / kilo gram)) b'0)),
>                          (((n2 P.+ 1)><1) $ (map (/~ (mole / second / kilo gram)) b_2'0)),
>                          bigE'0,
>                          (0.0 *~ day))

And now we can plot what happens for a period of 100 days.

References

Ackleh, Azmy S., and Jeremy J. Thibodeaux. 2013. “A second-order finite difference approximation for a mathematical model of erythropoiesis.” Numerical Methods for Partial Differential Equations, no. November: n/a–n/a. doi:10.1002/num.21778.

Ackleh, Azmy S., Keng Deng, Kazufumi Ito, and Jeremy Thibodeaux. 2006. “A Structured Erythropoiesis Model with Nonlinear Cell Maturation Velocity and Hormone Decay Rate.” Mathematical Biosciences 204 (1): 21–48. doi:http://dx.doi.org/10.1016/j.mbs.2006.08.004.

Banks, H T, Cammey E Cole, Paul M Schlosser, and Hien T Tran. 2003. “Modeling and Optimal Regulation of Erythropoiesis Subject to Benzene Intoxication.” https://www.ncsu.edu/crsc/reports/ftp/pdf/crsc-tr03-49.pdf.

Belair, Jacques, Michael C. Mackey, and Joseph M. Mahaffy. 1995. “Age-Structured and Two-Delay Models for Erythropoiesis.” Mathematical Biosciences 128 (1): 317–46. doi:http://dx.doi.org/10.1016/0025-5564(94)00078-E.

Gurtin, Morton E, and Richard C MacCamy. 1974. “Non-Linear Age-Dependent Population Dynamics.” Archive for Rational Mechanics and Analysis 54 (3). Springer: 281–300.

Hall, Adam J, Michael J Chappell, John AD Aston, and Stephen A Ward. 2013. “Pharmacokinetic Modelling of the Anti-Malarial Drug Artesunate and Its Active Metabolite Dihydroartemisinin.” Computer Methods and Programs in Biomedicine 112 (1). Elsevier: 1–15.

Jelkmann, Wolfgang. 2009. “Efficacy of Recombinant Erythropoietins: Is There Unity of International Units?” Nephrology Dialysis Transplantation 24 (5): 1366. doi:10.1093/ndt/gfp058.

Sawyer, Stephen T, SB Krantz, and E Goldwasser. 1987. “Binding and Receptor-Mediated Endocytosis of Erythropoietin in Friend Virus-Infected Erythroid Cells.” Journal of Biological Chemistry 262 (12). ASBMB: 5554–62.

Thibodeaux, Jeremy J., and Timothy P. Schlittenhardt. 2011. “Optimal Treatment Strategies for Malaria Infection.” Bulletin of Mathematical Biology 73 (11): 2791–2808. doi:10.1007/s11538-011-9650-8.

Torbett, Bruce E., and Jeffrey S. Friedman. 2009. “Erythropoiesis: An Overview.” In Erythropoietins, Erythropoietic Factors, and Erythropoiesis: Molecular, Cellular, Preclinical, and Clinical Biology, edited by Steven G. Elliott, Mary Ann Foote, and Graham Molineux, 3–18. Basel: Birkhäuser Basel. doi:10.1007/978-3-7643-8698-6_1.

Warming up for NUTS (No U-Turn)

I have been thinking about writing a blog on why the no u-turn sampler (NUTS) works rather than describing the actual algorithm. This led me to look at Jared Tobin’s Haskell implementation. His example tries to explore the Himmelblau function but only finds one local minima. This is not unexpected; as the excellent Stan manual notes

Being able to carry out such invariant inferences in practice is an altogether different matter. It is almost always intractable to find even a single posterior mode, much less balance the exploration of the neighborhoods of multiple local maxima according to the probability masses.

and

For HMC and NUTS, the problem is that the sampler gets stuck in one of the two "bowls" around the modes and cannot gather enough energy from random momentum assignment to move from one mode to another.

rm(list = ls(all.names=TRUE))
unlink(".RData")

rstan::stan_version()
## [1] "2.12.0"
rstan_options(auto_write = TRUE)

On the Rosenbrock function it fares much better.

knitr::include_graphics("RosenbrockA.png")

plot of chunk unnamed-chunk-2

We can’t (at least I don’t know how to) try Stan out on Rosenbrock as its not a distribution but we can try it out on another nasty problem: the funnel. Some of this is taken directly from the Stan manual.

Here’s the Stan:

parameters {
  real y;
  vector[9] x;
}
model {
  y ~ normal(0,3);
  x ~ normal(0,exp(y/2));
}

which we can run with the following R:

funnel_fit <- stan(file='funnel.stan', cores=4, iter=10000)
## Warning: There were 92 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
funnel_samples <- extract(funnel_fit,permuted=TRUE,inc_warmup=FALSE);
funnel_df <- data.frame(x1=funnel_samples$x[,1],y=funnel_samples$y[])

Plotting the data requires some unpleasantness but shows the neck of the funnel does not get explored. So even HMC and NUTS do not perform well.

midpoints <- function(x, dp=2){
    lower <- as.numeric(gsub(",.*","",gsub("\\(|\\[|\\)|\\]","", x)))
    upper <- as.numeric(gsub(".*,","",gsub("\\(|\\[|\\)|\\]","", x)))
    return(round(lower+(upper-lower)/2, dp))
}

df <- funnel_df[funnel_df$x1 < 20 & funnel_df$x1 > -20 & funnel_df$y < 9 & funnel_df$y > -9,]
x_c <- cut(df$x1, 20)
y_c <- cut(df$y, 20)
z <- table(x_c, y_c)
z_df <- as.data.frame(z)
a_df <- data.frame(x=midpoints(z_df$x_c),y=midpoints(z_df$y_c),f=z_df$Freq)

m = as.matrix(dcast(a_df,x ~ y))
## Using f as value column: use value.var to override.
hist3D(x=m[,"x"],y=as.double(colnames(m)[2:21]),z=(as.matrix(dcast(a_df,x ~ y)))[,2:21], border="black",ticktype = "detailed",theta=5,phi=20)
## Using f as value column: use value.var to override.

Since the analytic form of the distribution is known, one can apply a trick to correct this problem and then one is sampling from unit normals!

parameters {
  real y_raw;
  vector[9] x_raw;
}
transformed parameters {
  real y;
  vector[9] x;

  y <- 3.0 * y_raw;
  x <- exp(y/2) * x_raw;
}
model {
  y_raw ~ normal(0,1);
  x_raw ~ normal(0,1);
}

And now the neck of the funnel is explored.

funnel_fit <- stan(file='funnel_reparam.stan', cores=4, iter=10000)
funnel_samples <- extract(funnel_fit,permuted=TRUE,inc_warmup=FALSE);
funnel_df <- data.frame(x1=funnel_samples$x[,1],y=funnel_samples$y[])

df <- funnel_df[funnel_df$x1 < 20 & funnel_df$x1 > -20 & funnel_df$y < 9 & funnel_df$y > -9,]
x_c <- cut(df$x1, 20)
y_c <- cut(df$y, 20)
z <- table(x_c, y_c)
z_df <- as.data.frame(z)
a_df <- data.frame(x=midpoints(z_df$x_c),y=midpoints(z_df$y_c),f=z_df$Freq)

m = as.matrix(dcast(a_df,x ~ y))
## Using f as value column: use value.var to override.
hist3D(x=m[,"x"],y=as.double(colnames(m)[2:21]),z=(as.matrix(dcast(a_df,x ~ y)))[,2:21], border="black",ticktype = "detailed",theta=5,phi=20)
## Using f as value column: use value.var to override.

We’d expect the Haskell implementation to also fail to explore the neck. Maybe I will return to this after the article on why NUTS works.

Calling Haskell from C

As part of improving the random number generation story for Haskell, I want to be able to use the testu01 library with the minimal amount of Haskell wrapping. testu01 assumes that there is a C function which returns the random number. The ghc manual gives an example but does not give all the specifics. These are my notes on how to get the example working under OS X (El Capitain 10.11.5 to be precise).

The Haskell:

{-# OPTIONS_GHC -Wall                 #-}

{-# LANGUAGE ForeignFunctionInterface #-}

module Foo where

foreign export ccall foo :: Int -> IO Int

foo :: Int -> IO Int
foo n = return (length (f n))

f :: Int -> [Int]
f 0 = []
f n = n:(f (n-1))

The .cabal:

name:                test-via-c
version:             0.1.0.0
homepage:            TBD
license:             MIT
author:              Dominic Steinitz
maintainer:          idontgetoutmuch@gmail.com
category:            System
build-type:          Simple
cabal-version:       >=1.10

executable Foo.dylib
  main-is: Foo.hs
  other-extensions:    ForeignFunctionInterface
  build-depends:       base >=4.7 && =0.6 && <0.7
  hs-source-dirs:      src
  default-language:    Haskell2010
  include-dirs:        src
  ghc-options:         -O2 -shared -fPIC -dynamic
  extra-libraries:     HSrts-ghc8.0.1

On my computer running

cabal install

places the library in

~/Library/Haskell/ghc-8.0.1/lib/test-via-c-0.1.0.0/bin

The C:

#include 
#include "HsFFI.h"

#include "../dist/build/Foo.dylib/Foo.dylib-tmp/Foo_stub.h"

int main(int argc, char *argv[])
{
  int i;

  hs_init(&argc, &argv);

  for (i = 0; i < 5; i++) {
    printf("%d\n", foo(2500));
  }

  hs_exit();
  return 0;
}

On my computer this can be compiled with

gcc-6 Bar.c
~/Library/Haskell/ghc-8.0.1/lib/test-via-c-0.1.0.0/bin/Foo.dylib
-I/Library/Frameworks/GHC.framework/Versions/8.0.1-x86_64/usr/lib/ghc-8.0.1/include
-L/Library/Frameworks/GHC.framework/Versions/8.0.1-x86_64/usr/lib/ghc-8.0.1/rts
-lHSrts-ghc8.0.1

and can be run with

DYLD_LIBRARY_PATH=
~/Library/Haskell/ghc-8.0.1/lib/test-via-c-0.1.0.0/bin:
/Library/Frameworks/GHC.framework/Versions/8.0.1-x86_64/usr/lib/ghc-8.0.1/rts

N.B. setting DYLD_LIBRARY_PATH like this is not recommended as it is a good way of breaking things. I have tried setting DYLD_FALLBACK_LIBRARY_PATH but only to get an error message. Hopefully, at some point I will be able to post a robust way of getting the executable to pick up the required dynamic libraries.

UK / South Korea Trade: A Bayesian Analysis

Introduction

I was intrigued by a tweet by the UK Chancellor of the Exchequer stating "exports [to South Korea] have doubled over the last year. Now worth nearly £11bn” and a tweet by a Member of the UK Parliament stating South Korea "our second fastest growing trading partner". Although I have never paid much attention to trade statistics, both these statements seemed surprising. But these days it’s easy enough to verify such statements. It’s also an opportunity to use the techniques I believe data scientists in (computer) game companies use to determine how much impact a new feature has on the game’s consumers.

One has to be slightly careful with trade statistics as they come in many different forms, e.g., just goods or goods and services etc. When I provide software and analyses to US organisations, I am included in the services exports from the UK to the US.

Let’s analyse goods first before moving on to goods and services.

Getting the Data

First let’s get hold of the quarterly data from the UK Office of National Statistics.

ukstats <- "https://www.ons.gov.uk"
bop <- "economy/nationalaccounts/balanceofpayments"
ds <- "datasets/tradeingoodsmretsallbopeu2013timeseriesspreadsheet/current/mret.csv"

mycsv <- read.csv(paste(ukstats,"file?uri=",bop,ds,sep="/"),stringsAsFactors=FALSE)

Now we can find the columns that refer to Korea.

ns <- which(grepl("Korea", names(mycsv)))
length(ns)
## [1] 3
names(mycsv[ns[1]])
## [1] "BoP.consistent..South.Korea..Exports..SA................................"
names(mycsv[ns[2]])
## [1] "BoP.consistent..South.Korea..Imports..SA................................"
names(mycsv[ns[3]])
## [1] "BoP.consistent..South.Korea..Balance..SA................................"

Now we can pull out the relevant information and create a data frame of it.

korean <- mycsv[grepl("Korea", names(mycsv))]
imports <- korean[grepl("Imports", names(korean))]
exports <- korean[grepl("Exports", names(korean))]
balance <- korean[grepl("Balance", names(korean))]

df <- data.frame(mycsv[grepl("Title", names(mycsv))],
                 imports,
                 exports,
                 balance)
colnames(df) <- c("Title", "Imports", "Exports", "Balance")

startQ <- which(grepl("1998 Q1",df$Title))
endQ <- which(grepl("2016 Q3",df$Title))
dfQ <- df[startQ:endQ,]

We can now plot the data.

tab <- data.frame(kr=as.numeric(dfQ$Exports),
                  krLabs=as.numeric(as.Date(as.yearqtr(dfQ$Title,format='%Y Q%q'))))

ggplot(tab, aes(x=as.Date(tab$krLabs), y=tab$kr)) + geom_line() +
    theme(legend.position="bottom") +
    ggtitle("Goods Exports UK / South Korea (Quarterly)") +
    theme(plot.title = element_text(hjust = 0.5)) +
    xlab("Date") +
    ylab("Value (£m)")

For good measure let’s plot the annual data.

startY <- grep("^1998$",df$Title)
endY <- grep("^2015$",df$Title)
dfYear <- df[startY:endY,]

tabY <- data.frame(kr=as.numeric(dfYear$Exports),
                   krLabs=as.numeric(dfYear$Title))

ggplot(tabY, aes(x=tabY$krLabs, y=tabY$kr)) + geom_line() +
    theme(legend.position="bottom") +
    ggtitle("Goods Exports UK / South Korea (Annual)") +
    theme(plot.title = element_text(hjust = 0.5)) +
    xlab("Date") +
    ylab("Value (£m)")

And the monthly data.

startM <- grep("1998 JAN",df$Title)
endM <- grep("2016 OCT",df$Title)
dfMonth <- df[startM:endM,]

tabM <- data.frame(kr=as.numeric(dfMonth$Exports),
                   krLabs=as.numeric(as.Date(as.yearmon(dfMonth$Title,format='%Y %B'))))

ggplot(tabM, aes(x=as.Date(tabM$krLabs), y=tabM$kr)) + geom_line() +
    theme(legend.position="bottom") +
    ggtitle("Goods Exports UK / South Korea (Monthly)") +
    theme(plot.title = element_text(hjust = 0.5)) +
    xlab("Date") +
    ylab("Value (£m)")

It looks like some change took place in 2011 but nothing to suggest either that "export have doubled over the last year" or that South Korea is "our second fastest growing partner". That some sort of change did happen is further supported by the fact a Free Trade Agreement between the EU and Korea was put in place in 2011.

But was there really a change? And what sort of change was it? Sometimes it’s easy to imagine patterns where there are none.

With this warning in mind let us see if we can get a better feel from the numbers as to what happened.

The Model

Let us assume that the data for exports are approximated by a linear function of time but that there is a change in the slope and the offset at some point during observation.

\begin{aligned} \tau &\sim {\mathrm{Uniform}}(1, N) \\ \mu_1 &\sim \mathcal{N}(\mu_{\mu_1}, \sigma_{\mu_1}) \\ \gamma_1 &\sim \mathcal{N}(\mu_{\gamma_1}, \sigma_{\gamma_1}) \\ \sigma_1 &\sim \mathcal{N}(\mu_{\sigma_1}, \sigma_{\sigma_1}) \\ \mu_2 &\sim \mathcal{N}(\mu_{\mu_2}, \sigma_{\mu_2}) \\ \gamma_2 &\sim \mathcal{N}(\mu_{\gamma_2}, \sigma_{\gamma_2}) \\ \sigma_2 &\sim \mathcal{N}(\mu_{\sigma_2}, \sigma_{\sigma_2}) \\ y_i &\sim \begin{cases} \mathcal{N}(\mu_1 x_i + \gamma_1, \sigma_1) & \mbox{if } i < \tau \\ \mathcal{N}(\mu_2 x_i + \gamma_2, \sigma_2), & \mbox{if } i  \geq \tau \end{cases} \end{aligned}

Since we are going to use stan to infer the parameters for this model and stan cannot handle discrete parameters, we need to marginalize out this (discrete) parameter. I hope to do the same analysis with LibBi which seems more suited to time series analysis and which I believe will not require such a step.

Setting D = {yi}i = 1N we can calculate the likelihood

\begin{aligned} p(D \,|\, \mu_1, \gamma_1, \sigma_1, \mu_2, \gamma_2, \sigma_2) &= \sum_{n=1}^N p(\tau, D \,|\, \mu_1, \gamma_1, \sigma_1, \mu_2, \gamma_2, \sigma_2) \\ &= \sum_{\tau=1}^N p(\tau) p(D \,|\, \tau, \mu_1, \sigma_1, \mu_2, \sigma_2) \\ &=\sum_{\tau=1}^N p(\tau) \prod_{i=1}^N p(y_i \,|\, \tau, \mu_1, \gamma_1, \sigma_1, \mu_2, \gamma_2, \sigma_2) \end{aligned}

stan operates on the log scale and thus requires the log likelihood

\log p(D \,|\, \mu_1, \gamma_1, \sigma_1, \mu_2, \gamma_2, \sigma_2) = \mathrm{log\_sum\_exp}_{\tau=1}^T \big(   \log \mathcal{U}(\tau \, | \, 1, T) \\   + \sum_{i=1}^T \log \mathcal{N}(y_i \, | \, \nu_i, \rho_i)   \big)

where

\begin{aligned}   \nu_i &=   \begin{cases}     \mu_1 x_i + \gamma_1 & \mbox{if } i < \tau \\     \mu_2 x_i + \gamma_2 & \mbox{if } i  \geq \tau   \end{cases} \\   \rho_i &=   \begin{cases}     \sigma_1 & \mbox{if } i < \tau \\     \sigma_2 & \mbox{if } i  \geq \tau   \end{cases} \end{aligned}

and where the log sum of exponents function is defined by

\mathrm{\log\_sum\_exp}_{n=1}^N \, \alpha_n = \log \sum_{n=1}^N \exp(\alpha_n).

The log sum of exponents function allows the model to be coded directly in Stan using the built-in function , which provides both arithmetic stability and efficiency for mixture model calculations.

Stan

Here’s the model in stan. Sadly I haven’t found a good way of divvying up .stan files in a .Rmd file so that it still compiles.

data {
  int<lower=1> N;
  real x[N];
  real y[N];
}

parameters {
  real mu1;
  real mu2;
  real gamma1;
  real gamma2;

  real<lower=0> sigma1;
  real<lower=0> sigma2;
}

transformed parameters {
      vector[N] log_p;
      real mu;
      real sigma;
      log_p = rep_vector(-log(N), N);
      for (tau in 1:N)
        for (i in 1:N) {
          mu = i < tau ? (mu1 * x[i] + gamma1) : (mu2 * x[i] + gamma2);
          sigma = i < tau ? sigma1 : sigma2;
          log_p[tau] = log_p[tau] + normal_lpdf(y[i] | mu, sigma);
      }
}

model {
    mu1 ~ normal(0, 10);
    mu2 ~ normal(0, 10);
    gamma1 ~ normal(0, 10);
    gamma2 ~ normal(0, 10);
    sigma1 ~ normal(0, 10);
    sigma2 ~ normal(0, 10);

    target += log_sum_exp(log_p);
}

generated quantities {
    int<lower=1,upper=N> tau;
    tau = categorical_rng(softmax(log_p));
}

The above, although mimicking our mathematical model, has quadratic complexity and we can use the trick in the stan manual to make it linear albeit with less clarity.

data {
  int<lower=1> N;
  real x[N];
  real y[N];
}

parameters {
  real mu1;
  real mu2;
  real gamma1;
  real gamma2;

  real<lower=0> sigma1;
  real<lower=0> sigma2;
}

transformed parameters {
  vector[N] log_p;
  {
    vector[N+1] log_p_e;
    vector[N+1] log_p_l;
    log_p_e[1] = 0;
    log_p_l[1] = 0;
    for (i in 1:N) {
      log_p_e[i + 1] = log_p_e[i] + normal_lpdf(y[i] | mu1 * x[i] + gamma1, sigma1);
      log_p_l[i + 1] = log_p_l[i] + normal_lpdf(y[i] | mu2 * x[i] + gamma2, sigma2);
    }
    log_p = rep_vector(-log(N) + log_p_l[N + 1], N) + head(log_p_e, N) - head(log_p_l, N);
  }
}

model {
    mu1 ~ normal(0, 10);
    mu2 ~ normal(0, 10);
    gamma1 ~ normal(0, 10);
    gamma2 ~ normal(0, 10);
    sigma1 ~ normal(0, 10);
    sigma2 ~ normal(0, 10);

    target += log_sum_exp(log_p);
}

generated quantities {
    int<lower=1,upper=N> tau;
    tau = categorical_rng(softmax(log_p));
}

Let’s run this model with the monthly data.

NM <- nrow(tabM)
KM <- ncol(tabM)

yM <- tabM$kr
XM <- data.frame(tabM,rep(1,NM))[,2:3]

fitM <- stan(
    file = "lr-changepoint-ng.stan",
    data = list(x = XM$krLabs, y = yM, N = length(yM)),
    chains = 4,
    warmup = 1000,
    iter = 10000,
    cores = 4,
    refresh = 500,
    seed=42
)
## Warning: There were 662 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems

Looking at the results below we see a multi-modal distribution so a mean is not of much use.

histData <- hist(extract(fitM)$tau,plot=FALSE,breaks=c(seq(1,length(yM),1)))
histData$counts
##   [1] 18000     0     0     0     0     0     0     0     0     0     0
##  [12]     0     0     0     0     0     0     0     0     0     0     0
##  [23]     0     0     0     0     0     0     0     0     0     0     0
##  [34]     0     0     0     0     0     0     0     0     0     0     0
##  [45]     0     0     0     0     0     0     0     0     0     0     0
##  [56]     0     0     0     0     0     0     0     0     0     0     0
##  [67]     0     0     0     0     0     0     0     0     0     0     0
##  [78]     0     0     0     0     0     0     0     0     0     0     0
##  [89]     0     0     0     0     0     0     0     0     0     0     0
## [100]     0     0     0     0     0     0     0     0     0     0     0
## [111]     0     0     0     0     0     0     0     1     4    12    16
## [122]    16   107   712  8132     0     0     0     0     0     0     0
## [133]     0     0     0     0     0     0     0     0     0     0     0
## [144]     0     0     0     0     0     0     0     0     0     0     0
## [155]     0     0     0     0     0     0     0     0     0     0    25
## [166]   171  2812     0     0     0     0     0     0     0     0     0
## [177]     0     0     0     0     0     0     0     0     0     0     0
## [188]     0     0     0     0     0     0     0     0     0     0     0
## [199]     0     0     0     0     0     0     0     0     0     0     0
## [210]     0     0     0     0     0     0     0     0     0     0     0
## [221]     0     0     0     0  5992

We can get a pictorial representation of the maxima so that the multi-modality is even clearer.

min_indexes = which(diff(  sign(diff( c(0,histData$counts,0)))) == 2)
max_indexes = which(diff(  sign(diff( c(0,histData$counts,0)))) == -2)
modeData = data.frame(x=1:length(histData$counts),y=histData$counts)
min_locs = modeData[min_indexes,]
max_locs = modeData[max_indexes,]
plot(modeData$y, type="l")
points( min_locs, col="red", pch=19, cex=1  )
points( max_locs, col="green", pch=19, cex=1  )

My interpretation is that the evidence (data) says there is probably no changepoint (a change at the beginning or end is no change) but there might be a change at intermediate data points.

We can see something strange (maybe a large single export?) happened at index 125 which translates to 2008MAY.

The mode at index 167 which translates to 2011NOV corresponds roughly to the EU / Korea trade agreement.

Let us assume that there really was a material difference in trade at this latter point. We can fit a linear regression before this point and one after this point.

Here’s the stan

data {
  int<lower=1> N;
  int<lower=1> K;
  matrix[N,K]  X;
  vector[N]    y;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(X * beta, sigma);
}

And here’s the R to fit the before and after data. We fit the model, pull out the parameters for the regression and pull out the covariates

N <- length(yM)
M <- max_locs$x[3]

fite <- stan(file = 'LR.stan',
             data = list(N = M, K = ncol(XM), y = yM[1:M], X = XM[1:M,]),
             pars=c("beta", "sigma"),
             chains=3,
             cores=3,
             iter=3000,
             warmup=1000,
             refresh=-1)

se <- extract(fite, pars = c("beta", "sigma"), permuted=TRUE)
estCovParamsE <- colMeans(se$beta)

fitl <- stan(file = 'LR.stan',
             data = list(N = N-M, K = ncol(XM), y = yM[(M+1):N], X = XM[(M+1):N,]),
             pars=c("beta", "sigma"),
             chains=3,
             cores=3,
             iter=3000,
             warmup=1000,
             refresh=-1)

sl <- extract(fitl, pars = c("beta", "sigma"), permuted=TRUE)
estCovParamsL <- colMeans(sl$beta)

Make predictions

linRegPredsE <- data.matrix(XM) %*% estCovParamsE
linRegPredsL <- data.matrix(XM) %*% estCovParamsL
ggplot(tabM, aes(x=as.Date(tabM$krLabs), y=tabM$kr)) +
    geom_line(aes(x = as.Date(tabM$krLabs), y = tabM$kr, col = "Actual")) +
    geom_line(data=tabM[1:M,], aes(x = as.Date(tabM$krLabs[1:M]), y = linRegPredsE[(1:M),1], col = "Fit (Before FTA)")) +
    geom_line(data=tabM[(M+1):N,], aes(x = as.Date(tabM$krLabs[(M+1):N]), y = linRegPredsL[((M+1):N),1], col = "Fit (After FTA)")) +
    theme(legend.position="bottom") +
    ggtitle("Goods Exports UK / South Korea (Monthly)") +
    theme(plot.title = element_text(hjust = 0.5)) +
    xlab("Date") +
    ylab("Value (£m)")

An Intermediate Conclusion and Goods and Services (Pink Book)

So we didn’t manage to substantiate either the Chancellor’s claim or the Member of Parliament’s claim.

But it may be that we can if we look at Goods and Services then we might be able to see the numbers resulting in the claims.

pb <- "datasets/pinkbook/current/pb.csv"
pbcsv <- read.csv(paste(ukstats,"file?uri=",bop,pb,sep="/"),stringsAsFactors=FALSE)

This has a lot more information albeit only annually.

pbns <- grep("Korea", names(pbcsv))
length(pbns)
## [1] 21
lapply(pbns,function(x) names(pbcsv[x]))
## [[1]]
## [1] "BoP..Current.Account..Goods...Services..Imports..South.Korea............"
## 
## [[2]]
## [1] "BoP..Current.Account..Current.Transfer..Balance..South.Korea............"
## 
## [[3]]
## [1] "BoP..Current.Account..Goods...Services..Balance..South.Korea............"
## 
## [[4]]
## [1] "IIP..Assets..Total.South.Korea.........................................."
## 
## [[5]]
## [1] "Trade.in.Services.replaces.1.A.B....Exports.Credits...South.Korea...nsa."
## 
## [[6]]
## [1] "IIP...Liabilities...Total...South.Korea................................."
## 
## [[7]]
## [1] "BoP..Total.income..Balance..South.Korea................................."
## 
## [[8]]
## [1] "BoP..Total.income..Debits..South.Korea.................................."
## 
## [[9]]
## [1] "BoP..Total.income..Credits..South.Korea................................."
## 
## [[10]]
## [1] "BoP..Current.account..Balance..South.Korea.............................."
## 
## [[11]]
## [1] "BoP..Current.account..Debits..South.Korea..............................."
## 
## [[12]]
## [1] "BoP..Current.account..Credits..South.Korea.............................."
## 
## [[13]]
## [1] "IIP...Net...Total....South.Korea........................................"
## 
## [[14]]
## [1] "Trade.in.Services.replaces.1.A.B....Imports.Debits...South.Korea...nsa.."
## 
## [[15]]
## [1] "BoP..Current.Account..Services..Total.Balance..South.Korea.............."
## 
## [[16]]
## [1] "Bop.consistent..Balance..NSA..South.Korea..............................."
## 
## [[17]]
## [1] "Bop.consistent..Im..NSA..South.Korea...................................."
## 
## [[18]]
## [1] "Bop.consistent..Ex..NSA..South.Korea...................................."
## 
## [[19]]
## [1] "Current.Transfers...Exports.Credits...South.Korea...nsa................."
## 
## [[20]]
## [1] "Current.Transfers...Imports.Debits...South.Korea...nsa.................."
## 
## [[21]]
## [1] "BoP..Current.Account..Goods...Services..Exports..South.Korea............"

Let’s just look at exports.

koreanpb <- pbcsv[grepl("Korea", names(pbcsv))]
exportspb <- koreanpb[grepl("Exports", names(koreanpb))]
names(exportspb)
## [1] "Trade.in.Services.replaces.1.A.B....Exports.Credits...South.Korea...nsa."
## [2] "Current.Transfers...Exports.Credits...South.Korea...nsa................."
## [3] "BoP..Current.Account..Goods...Services..Exports..South.Korea............"

The last column gives exports of Goods and Services so let’s draw a chart of it.

pb <- data.frame(pbcsv[grepl("Title", names(pbcsv))],
                 exportspb[3])
colnames(pb) <- c("Title", "Exports")

startpbY <- which(grepl("1999",pb$Title))
endpbY <- which(grepl("2015",pb$Title))
pbY <- pb[startpbY:endpbY,]

tabpbY <- data.frame(kr=as.numeric(pbY$Exports),
                     krLabs=as.numeric(pbY$Title))

ggplot(tabpbY, aes(x=tabpbY$krLabs, y=tabpbY$kr)) + geom_line() +
    theme(legend.position="bottom") +
    ggtitle("Goods and Services Exports UK / South Korea (Annual)") +
    theme(plot.title = element_text(hjust = 0.5)) +
    xlab("Date") +
    ylab("Value (£m)")

No joy here either to any of the claims. Still it’s been an interesting exercise.

Girsanov’s Theorem

Introduction

We previously used importance sampling in the case where we did not have a sampler available for the distribution from which we wished to sample. There is an even more compelling case for using importance sampling.

Suppose we wish to estimate the probability of a rare event. For example, suppose we wish to estimate \mathbb{P}(X > 5) where X \sim {\mathcal{N}}(0,1). In this case, we can look up the answer \mathbb{P}(X > 5) \approx 2.86710^{-7}. But suppose we couldn’t look up the answer. One strategy that might occur to us is to sample and then estimate the probability by counting the number of times out of the total that the sample was bigger than 5. The flaw in this is obvious but let’s try it anyway.

> module Girsanov where
> import qualified Data.Vector as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.Number.Erf
> import Data.List ( transpose )
> samples :: (Foldable f, MonadRandom m) =>
>                     (Int -> RVar Double -> RVar (f Double)) ->
>                     Int ->
>                     m (f Double)
> samples repM n = sample $ repM n $ stdNormal
> biggerThan5 :: Int
> biggerThan5 = length (evalState xs (pureMT 42))
>   where
>     xs :: MonadRandom m => m [Double]
>     xs = liftM (filter (>= 5.0)) $ samples replicateM 100000

As we might have expected, even if we draw 100,000 samples, we estimate this probability quite poorly.

ghci> biggerThan5
  0

Using importance sampling we can do a lot better.

Let \xi be a normally distributed random variable with zero mean and unit variance under the Lebesgue measure \mathbb{P}. As usual we can then define a new probability measure, the law of \xi, by

\displaystyle   \begin{aligned}  \mathbb{P}_\xi((-\infty, b])  &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^b e^{-x^2/2}\,\mathrm{d}x  \end{aligned}

Thus

\displaystyle   \begin{aligned}  \mathbb{E}_\xi(f) &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty f(x) e^{-x^2/2}\,\mathrm{d}x \\  &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty f(x) e^{a^2/2}e^{-a x}e^{-(x-a)^2/2}\,\mathrm{d}x \\  &= \mathbb{E}_{\xi + a}(fg) \\  &= \mathbb{\tilde{E}}_{\xi + a}(f)  \end{aligned}

where we have defined

\displaystyle   g(x) \triangleq e^{a^2/2}e^{-a x}  \quad \mathrm{and} \quad  \mathbb{\tilde{P}}((-\infty, b]) \triangleq \int_{-\infty}^b g(x)\,\mathrm{d}x

Thus we can estimate \mathbb{P}(X > 5) either by sampling from a normal distribution with mean 0 and counting the number of samples that are above 5 or we can sample from a normal distribution with mean 5 and calculating the appropriately weighted mean

\displaystyle   \frac{1}{n}\sum_{i=1}^n \mathbb{I}_{\{x > 5\}}g(y)

Let’s try this out.

> biggerThan5' :: Double
> biggerThan5' = sum (evalState xs (pureMT 42)) / (fromIntegral n)
>   where
>     xs :: MonadRandom m => m [Double]
>     xs = liftM (map g) $
>          liftM (filter (>= 5.0)) $
>          liftM (map (+5)) $
>          samples replicateM n
>     g x = exp $ (5^2 / 2) - 5 * x
>     n = 100000

And now we get quite a good estimate.

ghci> biggerThan5'
  2.85776225450217e-7

Random Paths

The probability of another rare event we might wish to estimate is that of Brownian Motion crossing a boundary. For example, what is the probability of Browian Motion crossing the line y = 3.5? Let’s try sampling 100 paths (we restrict the number so the chart is still readable).

> epsilons :: (Foldable f, MonadRandom m) =>
>                     (Int -> RVar Double -> RVar (f Double)) ->
>                     Double ->
>                     Int ->
>                     m (f Double)
> epsilons repM deltaT n = sample $ repM n $ rvar (Normal 0.0 (sqrt deltaT))
> bM0to1 :: Foldable f =>
>           ((Double -> Double -> Double) -> Double -> f Double -> f Double)
>           -> (Int -> RVar Double -> RVar (f Double))
>           -> Int
>           -> Int
>           -> f Double
> bM0to1 scan repM seed n =
>   scan (+) 0.0 $
>   evalState (epsilons repM (recip $ fromIntegral n) n) (pureMT (fromIntegral seed))

We can see by eye in the chart below that again we do quite poorly.

We know that \mathbb{P}(T_a \leq t) = 2(1 - \Phi (a / \sqrt{t})) where T_a = \inf \{t : W_t = a\}.

> p :: Double -> Double -> Double
> p a t = 2 * (1 - normcdf (a / sqrt t))
ghci> p 1.0 1.0
  0.31731050786291415

ghci> p 2.0 1.0
  4.550026389635842e-2

ghci> p 3.0 1.0
  2.699796063260207e-3

But what if we didn’t know this formula? Define

\displaystyle   N(\omega) \triangleq  \begin{cases}  1 & \text{if } \sup_{0 \leq t \leq 1}\tilde W_t \geq a \\  0 & \text{if } \sup_{0 \leq t \leq 1}\tilde W_t < a \\  \end{cases}

where \mathbb{Q} is the measure which makes \tilde W_t Brownian Motion.

We can estimate the expectation of N

\displaystyle   \hat p_{\mathbb{Q}} = \frac{1}{M}\sum_{i=1}^H n_i

where n_i is 1 if Brownian Motion hits the barrier and 0 otherwise and M is the total number of simulations. We know from visual inspection that this gives poor results but let us try some calculations anyway.

> n = 500
> m = 10000
> supAbove :: Double -> Double
> supAbove a = fromIntegral count / fromIntegral n
>   where
>     count = length $
>             filter (>= a) $
>             map (\seed -> maximum $ bM0to1 scanl replicateM seed m) [0..n - 1]
> bM0to1WithDrift seed mu n =
>   zipWith (\m x -> x + mu * m * deltaT) [0..] $
>   bM0to1 scanl replicateM seed n
>     where
>       deltaT = recip $ fromIntegral n
ghci> supAbove 1.0
  0.326

ghci> supAbove 2.0
  7.0e-2

ghci> supAbove 3.0
  0.0

As expected for a rare event we get an estimate of 0.

Fortunately we can use importance sampling for paths. If we take \mu(\omega, t) = a where a is a constant in Girsanov’s Theorem below then we know that \tilde W_t = W_t + \int_0^t a \,\mathrm{d}s = W_t + at is \mathbb{Q}-Brownian Motion.

We observe that

\displaystyle   \begin{aligned}  \mathbb{Q}N &= \mathbb{P}\bigg(N\frac{\mathrm{d} \mathbb{Q}}{\mathrm{d} \mathbb{P}}\bigg) \\  &=  \mathbb{P}\Bigg[N  \exp \Bigg(-\int_0^1  \mu(\omega,t) \,\mathrm{d}W_t - \frac{1}{2} \int_0^1 \mu^2(\omega, t) \,\mathrm{d} t\Bigg)  \Bigg] \\  &=  \mathbb{P}\Bigg[N  \exp \Bigg(-aW_1 - \frac{1}{2} a^2\Bigg)  \Bigg]  \end{aligned}

So we can also estimate the expectation of N under \mathbb{P} as

\displaystyle   \hat p_{\mathbb{P}} = \frac{1}{M}\sum_{i=1}^H n_i\exp{\bigg(-aw^{(1)}_i - \frac{a^2}{2}\bigg)}

where n_i is now 1 if Brownian Motion with the specified drift hits the barrier and 0 otherwise, and w^{(1)}_i is Brownian Motion sampled at t=1.

We can see from the chart below that this is going to be better at hitting the required barrier.

Let’s do some calculations.

> supAbove' a = (sum $ zipWith (*) ns ws) / fromIntegral n
>   where
>     deltaT = recip $ fromIntegral m
> 
>     uss = map (\seed -> bM0to1 scanl replicateM seed m) [0..n - 1]
>     ys = map last uss
>     ws = map (\x -> exp (-a * x - 0.5 * a^2)) ys
> 
>     vss = map (zipWith (\m x -> x + a * m * deltaT) [0..]) uss
>     sups = map maximum vss
>     ns = map fromIntegral $ map fromEnum $ map (>=a) sups
ghci> supAbove' 1.0
  0.31592655955519156

ghci> supAbove' 2.0
  4.999395029856741e-2

ghci> supAbove' 3.0
  2.3859203473651654e-3

The reader is invited to try the above estimates with 1,000 samples per path to see that even with this respectable number, the calculation goes awry.

In General

If we have a probability space (\Omega, {\mathcal{F}}, \mathbb{P}) and a non-negative random variable Z with \mathbb{E}Z = 1 then we can define a new probability measure \mathbb{Q} on the same \sigma-algebra by

\displaystyle   \mathbb{Q} A \triangleq \int_A Z \,\mathrm{d} \mathbb{P}

For any two probability measures when such a Z exists, it is called the Radon-Nikodym derivative of \mathbb{Q} with respect to \mathbb{P} and denoted \frac{\mathrm{d} \mathbb{Q}}{\mathrm{d} \mathbb{P}}

Given that we managed to shift a Normal Distribution with non-zero mean in one measure to a Normal Distribution with another mean in another measure by producing the Radon-Nikodym derivative, might it be possible to shift, Brownian Motion with a drift under a one probability measure to be pure Brownian Motion under another probability measure by producing the Radon-Nikodym derivative? The answer is yes as Girsanov’s theorem below shows.

Girsanov’s Theorem

Let W_t be Brownian Motion on a probability space (\Omega, {\mathcal{F}}, \mathbb{P}) and let \{{\mathcal{F}}_t\}_{t \in [0,T]} be a filtration for this Brownian Motion and let \mu(\omega, t) be an adapted process such that the Novikov Sufficiency Condition holds

\displaystyle   \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^T \mu^2(s, \omega) \,\mathrm{d}s\bigg)}\bigg] = K < \infty

then there exists a probability measure \mathbb{Q} such that

  • \mathbb{Q} is equivalent to \mathbb{P}, that is, \mathbb{Q}(A) = 0 \iff \mathbb{P}(A) = 0.

  • \displaystyle {\frac{\mathrm{d}\mathbb{Q}}{\mathrm{d}\mathbb{P}} = \exp \Bigg(-\int_0^T \mu(\omega,t) \,\mathrm{d}W_t - \frac{1}{2} \int_0^T \mu^2(\omega, t) \,\mathrm{d} t\Bigg)}.

  • \tilde W_t = W_t + \int_0^t \mu(\omega, t) \,\mathrm{d}s is Brownian Motion on the probabiity space (\Omega, {\mathcal{F}}, \mathbb{Q}) also with the filtration \{\mathcal{F}_t\}_{t \in [0,T]}.

In order to prove Girsanov’s Theorem, we need a condition which allows to infer that M_t(\mu) is a strict martingale. One such useful condition to which we have already alluded is the Novikov Sufficiency Condition.

Proof

Define \mathbb{Q} by

\displaystyle   \mathbb{Q}(A) = \mathbb{P}(1_A M_T) \quad \mathrm{where} \quad  M_t(\mu) = \exp{\bigg(\int_0^t - \mu(t, \omega) \,\mathrm{d}W_s  -                        \frac{1}{2}\int_0^t \mu^2(t, \omega) \,\mathrm{d}s\bigg)}

Then, temporarily overloading the notation and writing \mathbb{P} for expectation under \mathbb{P}, and applying the Novikov Sufficiency Condition to f(s) - \mu(\omega ,s), we have

\displaystyle   \begin{aligned}  \mathbb{Q}\bigg[\exp{\int_0^T f(s) \,\mathrm{d}X_s}\bigg] &=  \mathbb{Q}\bigg[\exp{\int_0^T f(s) \,\mathrm{d}W_s + \int_0^T \mu(\omega, s) \,\mathrm{d}s}\bigg] \\  &=  \mathbb{P}\bigg[\exp{\bigg(  \int_0^T \big(f(s) - \mu(\omega, s)\big)\,\mathrm{d}W_s +  \int_0^T f(s)\mu(\omega, s)\,\mathrm{d}s -  \frac{1}{2}\int_0^T \mu^2(\omega ,s) \,\mathrm{d}s  \bigg)}\bigg] \\  &=  \mathbb{P}\bigg[\exp{\bigg(  \int_0^T \big(f(s) - \mu(\omega, s)\big)\,\mathrm{d}W_s -  \frac{1}{2}\int_0^T \big(f(s) - \mu(\omega ,s)\big)^2 \,\mathrm{d}s +  \frac{1}{2}\int_0^T f^2(s) \,\mathrm{d}s  \bigg)}\bigg] \\  &=  \frac{1}{2}\int_0^T f^2(s) \,\mathrm{d}s  \,  \mathbb{P}\bigg[\exp{\bigg(  \int_0^T \big(f(s) - \mu(\omega, s)\big)\,\mathrm{d}W_s -  \frac{1}{2}\int_0^T \big(f(s) - \mu(\omega ,s)\big)^2 \,\mathrm{d}s  \bigg)}\bigg] \\  &=  \frac{1}{2}\int_0^T f^2(s) \,\mathrm{d}s  \end{aligned}

From whence we see that

\displaystyle   \mathbb{Q}\big(e^{i \zeta (X_t - X_s)}\big) = e^{-\frac{1}{2} \zeta^2 (t - s)}

And since this characterizes Brownian Motion, we are done.

\blacksquare

The Novikov Sufficiency Condition

The Novikov Sufficiency Condition Statement

Let \mu \in {\cal{L}}^2_{\mathrm{LOC}}[0,T] and further let it satisfy the Novikov condition

\displaystyle   \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^T \mu^2(s, \omega) \,\mathrm{d}s\bigg)}\bigg] = K < \infty

then the process defined by

\displaystyle   M_t(\mu) = \exp{\bigg(\int_0^t \mu(t, \omega) \,\mathrm{d}W_s  -                        \frac{1}{2}\int_0^t \mu^2(t, \omega) \,\mathrm{d}s\bigg)}

is a strict martingale.

Before we prove this, we need two lemmas.

Lemma 1

Let M_t for t \in [0,t] be a non-negative local martingale then M_t is a super-martingale and if further \mathbb{E}M_T = \mathbb{E}M_0 then M_t is a strict martingale.

Proof

Let \{\tau_n\}_{n \in \mathbb{N}} be a localizing sequence for M_t then for 0 < s < t < T and using Fatou’s lemma and the fact that the stopped process is a strict martingale

\displaystyle   \mathbb{E}(M_t \,|\, {\mathcal{F}_s}) =  \mathbb{E}(\liminf_{n \rightarrow \infty} M_{t \land \tau_m} \,|\, {\mathcal{F}_s}) \leq  \liminf_{n \rightarrow \infty} \mathbb{E}(M_{t \land \tau_m} \,|\, {\mathcal{F}_s}) =  \liminf_{n \rightarrow \infty} M_{s \land \tau_m} = M_s

Thus M_t is a super-martingale and therefore

\displaystyle   \mathbb{E}M_T \leq \mathbb{E}M_t \leq \mathbb{E}M_s \leq \mathbb{E}M_0

By assumption we have \mathbb{E}M_T \leq \mathbb{E}M_0 thus M_t is a strict martingale.

\blacksquare

Lemma 2

Let M_t be a non-negative local martingale. If \{\tau_n\}_{n \in \mathbb{N}} is a localizing sequence such that \sup_n \|M_{T \land \tau_n}\|_p < \infty for some p > 1 then M_t is a strict martingale.

Proof

\displaystyle   \mathbb{E}(|M_T - M_{T \land \tau_n}|) \leq  \mathbb{E}(|M_T - r \land M_T) +  \mathbb{E}(|r \land M_T - r \land M_{T \land \tau_n}|) +  \mathbb{E}(M_{T \land \tau_n} - r \land M_{T \land \tau_n})

By the super-martingale property \mathbb{E}(M_T) \leq \mathbb{E}(M_0) < \infty and thus by dominated convergence we have that

\displaystyle   \lim_{r \rightarrow \infty} \mathbb{E}(r \land M_T) = \mathbb{E}(M_T) \quad \mathrm{and} \quad  \lim_{r \rightarrow \infty}\lim_{n \rightarrow \infty}\mathbb{E}(|r \land M_T - r \land M_{T \land \tau_n}|) = 0

We also have that

\displaystyle   \begin{aligned}  \mathbb{E}(M_{T \land \tau_n} - r \land M_{T \land \tau_n}) &=  \mathbb{E}((M_{T \land \tau_n} - r \land M_{T \land \tau_n}){I}_{(M_{T \land \tau_n} > r)}) +  \mathbb{E}((M_{T \land \tau_n} - r \land M_{T \land \tau_n}){I}_{(M_{T \land \tau_n} \leq r)}) \\  &= \mathbb{E}((M_{T \land \tau_n} - r \land M_{T \land \tau_n}){I}_{(M_{T \land \tau_n} > r)}) \\  &= \mathbb{E}(M_{T \land \tau_n}{I}_{(M_{T \land \tau_n} > r)}) - r\mathbb{P}({M_{T \land \tau_n} > r})  \end{aligned}

By Chebyshev’s inequality (see note (2) below), we have

\displaystyle   r\mathbb{P}({M_{T \land \tau_n} > r}) \leq \frac{r\mathbb{E}|X|^p}{r^p} \leq  \frac{\sup_n{\mathbb{E}(M_{T \land \tau_n})^p}}{r^{p-1}}

Taking limits first over n \rightarrow \infty and then over r \rightarrow \infty we see that

\displaystyle   \lim_{r \rightarrow \infty}\lim_{n \rightarrow \infty} r\mathbb{P}({M_{T \land \tau_n} > r}) \rightarrow 0

For 0 \leq r \leq x and p > 1 we have x \leq r^{1-p}x^p. Thus

\displaystyle   \mathbb{E}(M_{T \land \tau_n}{I}_{(M_{T \land \tau_n} > r)}) \leq  r^{1-p}\mathbb{E}(M_{T \land \tau_n}^p{I}_{(M_{T \land \tau_n} > r)}) \leq  r^{1-p}\sup_n(M_{T \land \tau_n}^p)

Again taking limits over n \rightarrow \infty and then over r \rightarrow \infty we have

\displaystyle   \lim_{r \rightarrow \infty}\lim_{n \rightarrow \infty} \mathbb{E}(M_{T \land \tau_n}{I}_{(M_{T \land \tau_n} > r)}) \rightarrow 0

These two conclusions imply

\displaystyle   \lim_{r \rightarrow \infty}\lim_{n \rightarrow \infty} \mathbb{E}(M_{T \land \tau_n} - r \land M_{T \land \tau_n}) \rightarrow 0

We can therefore conclude (since M_{T \land \tau_n} is a martingale)

\displaystyle   \mathbb{E}(M_T) = \lim_{n \rightarrow \infty}\mathbb{E}(M_{T \land \tau_n}) =  \mathbb{E}(M_0)

Thus by the preceeding lemma M_t is a strict as well as a local martingale.

\blacksquare

The Novikov Sufficiency Condition Proof

Step 1

First we note that M_t(\lambda\mu) is a local martingale for 0 < \lambda < 1. Let us show that it is a strict martingale. We can do this if for any localizing sequence \{\tau_n\}_{n \in \mathbb{N}} we can show

\displaystyle   \sup_n\mathbb{E}(M_{T \land \tau_n}(\lambda\mu))^p < \infty

using the preceeding lemma where p > 1.

We note that

\displaystyle   \begin{aligned}  M_t(\lambda\mu) &=  \exp{\bigg(\int^t_0 \lambda\mu(\omega, s)\,\mathrm{d}W_s -  \frac{1}{2}\int^t_0 \lambda^2\mu^2(\omega, s)\,\mathrm{d}s\bigg)} \\  &= {(M_t(\mu))}^{\lambda^2}\exp{\bigg((\lambda - \lambda^2)\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}  \end{aligned}

Now apply Hölder’s inequality with conjugates ({p\lambda^2})^{-1} and ({1 - p\lambda^2})^{-1} where p is chosen to ensure that the conjugates are both strictly greater than 1 (otherwise we cannot apply the inequality).

\displaystyle   \begin{aligned}  \mathbb{E}((M_t(\lambda\mu))^p)  &=  \mathbb{E}\bigg[{(M_t(\mu))}^{p\lambda^2}\exp{\bigg(p(\lambda - \lambda^2)\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg] \\  &\le  \bigg|\bigg|{M_t(\mu)}^{p\lambda^2}\bigg|\bigg|_{p\lambda^2}  \bigg|\bigg|\exp{\bigg(p(\lambda - \lambda^2)\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg|\bigg|_{1 - p\lambda^2} \\  &=  \mathbb{E}{\bigg[M_t(\mu)}\bigg]^{p\lambda^2}  \mathbb{E}\bigg[\exp{\bigg(p\frac{\lambda - \lambda^2}{1 - p\lambda^2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg]^{1 - p\lambda^2}  \end{aligned}

Now let us choose

\displaystyle   p\frac{\lambda - \lambda^2}{1 - p\lambda^2} = \frac{1}{2}

then

\displaystyle   \begin{aligned}  2p(\lambda - \lambda^2) &= 1 - p\lambda^2 \\  p & = \frac{1}{2(\lambda - \lambda^2) + \lambda^2} \\  p &= \frac{1}{(2 - \lambda)\lambda}  \end{aligned}

In order to apply Hölder’s inequality we need to check that (p\lambda^2)^{-1} > 1 and that (1 - p\lambda^2)^{-1} > 1 but this amounts to checking that p\lambda^2 > 0 and that 1 > \lambda. We also need to check that p > 0 but this amounts to checking that (2 - \lambda)\lambda < 1 for 0 < \lambda < 1 and this is easily checked to be true.

Re-writing the above inequality with this value of p we have

\displaystyle   \begin{aligned}  \mathbb{E}((M_t(\lambda\mu))^p)  &\le  \mathbb{E}{\bigg[M_t(\mu)}\bigg]^{p\lambda^2}  \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg]^{1 - p\lambda^2}  \end{aligned}

By the first lemma, since M_t(\mu) is a non-negative local martingale, it is also a supermartingale. Furthermore \mathbb{E}(M_0(\mu)) = 1. Thus

\displaystyle   \mathbb{E}{\bigg[M_t(\mu)}\bigg]^{p\lambda^2} \leq 1

and therefore

\displaystyle   \begin{aligned}  \mathbb{E}((M_t(\lambda\mu))^p)  &\le  \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg]^{1 - p\lambda^2}  \end{aligned}

Step 2

Recall we have

\displaystyle   {M_t} =  \exp\bigg(  \int_0^t \mu(\omega ,s)\,\mathrm{d}W_s - \frac{1}{2}\int_0^t \mu(\omega ,s)\,\mathrm{d}s  \bigg)

Taking logs gives

\displaystyle   \log{M_t} =  \int_0^t \mu(\omega ,s)\,\mathrm{d}W_s - \frac{1}{2}\int_0^t \mu(\omega ,s)^2\,\mathrm{d}s

or in diferential form

\displaystyle   \mathrm{d}(\log{M_t}) =  \mu(\omega ,t)\,\mathrm{d}W_t - \frac{1}{2}\mu(\omega ,t)^2\,\mathrm{d}t

We can also apply Itô’s rule to \log{M_t}

\displaystyle   \begin{aligned}  \mathrm{d}(\log{M_t})  &= \frac{1}{M_t}\,\mathrm{d}M_t   - \frac{1}{2}\frac{1}{M_t^2}\,\mathrm{d}[M]_t \\  \end{aligned}

where [\ldots] denotes the quadratic variation of a stochastic process.

Comparing terms gives the stochastic differential equation

\displaystyle   \mathrm{d}M_t = M_t\mu(\omega,t)\,\mathrm{d}W_t

In integral form this can also be written as

\displaystyle   M_t = 1 + \int_0^t M_s\mu(\omega, s)\,\mathrm{d}W_s

Thus M_t is a local martingale (it is defined by a stochastic differential equation) and by the first lemma it is a supermaringale. Hence \mathbb{E}M_t \leq 1.

Next we note that

\displaystyle   \exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t)\bigg)} =  \exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t) -       \frac{1}{4}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s\bigg)}  \exp{\bigg(\frac{1}{4}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s\bigg)}

to which we can apply Hölder’s inequality with conjugates p = q = 2 to obtain

\displaystyle   \begin{aligned}  \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t)\bigg)}\bigg] &=  \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t) -                             \frac{1}{4}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s                       \bigg)}                  \exp{\bigg(\frac{1}{4}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s                       \bigg)}\bigg] \\  & \leq  \sqrt{\mathbb{E}\bigg[\exp{\bigg(\int_0^t \mu(\omega, t) -                             \frac{1}{2}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s                       \bigg)}\bigg]}  \sqrt{\mathbb{E}\exp{\bigg(\frac{1}{2}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s                       \bigg)}\bigg]}  \end{aligned}

Applying the supermartingale inequality then gives

\displaystyle   \begin{aligned}  \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^t \mu(\omega, t)\bigg)}\bigg]  & \leq  \sqrt{\mathbb{E}\exp{\bigg(\frac{1}{2}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s                       \bigg)}\bigg]}  \end{aligned}

Step 3

Now we can apply the result in Step 2 to the result in Step 1.

\displaystyle   \begin{aligned}  \mathbb{E}((M_t(\lambda\mu))^p)  &\le  \mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg]^{1 - p\lambda^2} \\  &\le  {\mathbb{E}\bigg[\exp{\bigg(\frac{1}{2}\int_0^t \mu^2(\omega, t) \,\mathrm{d}s                        \bigg)}\bigg]}^{(1 - p\lambda^2)/2} \\  &\le  K^{(1 - p\lambda^2)/2}  \end{aligned}

We can replace M_t by M_t {\mathcal{I}}_{t < \tau} for any stopping time \tau. Thus for a localizing sequence we have

\displaystyle   \begin{aligned}  \mathbb{E}((M_{t \land \tau_n}(\lambda\mu))^p)  &\le  K^{(1 - p\lambda^2)/2}  \end{aligned}

From which we can conclude

\displaystyle   \sup_n \|M_{T \land \tau_n}(\lambda\mu)\|_p < \infty

Now we can apply the second lemma to conclude that M_{T \land \tau_n}(\lambda\mu) is a strict martingale.

Final Step

We have already calculated that

\displaystyle   \begin{aligned}  M_t(\lambda\mu) &=  \exp{\bigg(\int^t_0 \lambda\mu(\omega, s)\,\mathrm{d}W_s -  \frac{1}{2}\int^t_0 \lambda^2\mu^2(\omega, s)\,\mathrm{d}s\bigg)} \\  &= {(M_t(\mu))}^{\lambda^2}\exp{\bigg((\lambda - \lambda^2)\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}  \end{aligned}

Now apply Hölder’s inequality with conjugates p = \lambda^{-2} and q = (1 - \lambda^2)^{-1}.

\displaystyle   1 = \mathbb{E}(M_t(\lambda\mu) \le  \mathbb{E}(M_t(\mu))^{\lambda^2}\mathbb{E}{\bigg(}\exp{\bigg(\frac{\lambda}{1 + \lambda}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg)^{1 - \lambda^2}

And then we can apply Jensen’s inequality to the last term on the right hand side with the convex function x^{(1 + \lambda)/2\lambda}.

\displaystyle   1 \le  \mathbb{E}(M_t(\mu))^{\lambda^2}  \mathbb{E}{\bigg(}\exp{\bigg(\frac{1}{2}\int^t_0 \mu(\omega, s)\,\mathrm{d}W_s\bigg)}\bigg)^{2\lambda(1- \lambda)}

Using the inequality we established in Step 2 and the Novikov condition then gives

\displaystyle   1 \le  \mathbb{E}(M_t(\mu))^{\lambda^2}  K^{\lambda(1 - \lambda)}

If we now let \lambda \nearrow 1 we see that we must have 1 \le \mathbb{E}(M_t(\mu)). We already now that 1 \ge \mathbb{E}(M_t(\mu)) by the first lemma and so we have finally proved that M_t(\mu) is a martingale.

Notes

  1. We have already used importance sampling and also touched on changes of measure.

  2. Chebyshev’s inequality is usually stated for the second moment but the proof is easily adapted:

\displaystyle   \mathbb P( |X| > u ) = \int 1_{|X| > u} ~d\mathbb P = \frac 1 {u^p} \int u^p 1_{|X| > u} ~d\mathbb P < \frac 1 {u^p} \int |X|^p 1_{|X| > u} ~ d\mathbb P \le \frac 1 {u^p} \mathbb E|X|^p.

  1. We follow Handel (2007); a similar approach is given in Steele (2001).

Bibliography

Handel, Ramon von. 2007. “Stochastic Calculus, Filtering, and Stochastic Control (Lecture Notes).”

Steele, J.M. 2001. Stochastic Calculus and Financial Applications. Applications of Mathematics. Springer New York. https://books.google.co.uk/books?id=fsgkBAAAQBAJ.

Some Background on Hidden Markov Models

Introduction

If you look at the wikipedia article on Hidden Markov Models (HMMs) then you might be forgiven for concluding that these deal only with discrete time and finite state spaces. In fact, HMMs are much more general. Furthermore, a better understanding of such models can be helped by putting them into context. Before actually specifying what an HMM is, let us review something of Markov processes. A subsequent blog post will cover HMMs themselves.

Markov Process and Chains

Recall that a transition kernel is a mapping \mu : X \times {\cal{Y}} \rightarrow \overline{\mathbb{R}}_{+} where (X, {\cal{X}}) and (Y, {\cal{Y}}) are two measurable spaces such that \mu(s, \cdot) is a probability measure on {\cal{Y}} for all s \in X and such that \mu(\cdot, A) is a measurable function on X for all A \in {\cal{Y}}.

For example, we could have X = Y = \{a, b\} and {\cal{X}} = {\cal{Y}} = \{\emptyset, \{a\}, \{b\}, \{a,b\}\} and \mu(a,\{a\}) = 0.4, \mu(a,\{b\}) = 0.6, \mu(b,\{a\}) = 0.6, \mu(b,\{b\}) = 0.4. Hopefully this should remind you of the transition matrix of a Markov chain.

Recall further that a family of such transitions \{\mu_t\}_{t \in T} where T is some index set satisfying

\displaystyle  \mu_{t+s}(x, A) = \int_{Y} \mu_s(x, {\mathrm{d}y})\mu_t(y, A)

gives rise to a Markov process (under some mild conditions — see Rogers and Williams (2000) and Kallenberg (2002) for much more detail), that is, a process in which what happens next only depends on where the process is now and not how it got there.

Let us carry on with our example and take T = \mathbb{N}. With a slight abuse of notation and since Y is finite we can re-write the integral as a sum

\displaystyle  \mu_{n+m}(x, z) = \sum_{y \in Y} \mu_m(x, y)\mu_n(y, z)

which we recognise as a restatement of how Markov transition matrices combine.

Some Examples

A Fully Deterministic System

A deterministic system can be formulated as a Markov process with a particularly simple transition kernel given by

\displaystyle  \mu_t(x_s, A) = \delta(f_t(x_s), A) \triangleq  \begin{cases}  1           & \text{if } f_t(x_s) \in    A \\  0           & \text{if } f_t(x_s) \notin A  \end{cases}

where f_t is the deterministic state update function (the flow) and \delta is the Dirac delta function.

Parameters

Let us suppose that the determinstic system is dependent on some time-varying values for which we we are unable or unwish to specify a deterministic model. For example, we may be considering predator-prey model where the parameters cannot explain every aspect. We could augment the deterministic kernel in the previous example with

\displaystyle  \mu_t(\theta_s, {\mathrm{d}\phi}) =  {{\cal{N}} \left( {\mathrm{d}\phi} \,\vert\, \theta_s, \tau^2(t-s) \right)}

where we use Greek letters for the parameters (and Roman letters for state) and we use e.g. {\mathrm{d}\phi} to indicate probability densities. In other words that the parameters tend to wiggle around like Brown’s pollen particles rather than remaining absolutely fixed.

Ornstein-Uhlenbeck

Of course Brownian motion or diffusion may not be a good model for our parameters; with Brownian motion, the parameters could drift off to \pm\infty. We might believe that our parameters tend to stay close to some given value (mean-reverting) and use the Ornstein-Uhlenbeck kernel.

\displaystyle   \mu_t(\theta_s, {\mathrm{d}\phi}) =  {{\cal{N}} \left( {\mathrm{d}\phi} \,\vert\, \alpha + (\theta_s - \alpha)e^{-\beta t},\frac{\sigma^2}{2\beta}\big(1 - e^{-2\beta t}\big) \right)}

where \beta expresses how strongly we expect the parameter to respond to perturbations, \alpha is the mean to which the process wants to revert (aka the asymptotic mean) and \sigma^2 expresses how noisy the process is.

It is sometimes easier to view these transition kernels in terms of stochastic differential equations. Brownian motion can be expressed as

\displaystyle   \mathrm{d}X_t = \sigma\mathrm{d}W_t

and Ornstein-Uhlenbeck can be expressed as

\displaystyle   \mathrm{d}X_t = -\beta(X_t - \alpha)\mathrm{d}t + \sigma\mathrm{d}W_t

where W_t is the Wiener process.

Let us check that the latter stochastic differential equation gives the stated kernel. Re-writing it in integral form and without loss of generality taking s= 0

\displaystyle   X_t = \alpha + (x_0 - \alpha)e^{-\beta t} + \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s

Since the integral is of a deterministic function, the distribution of X_t is normal. Thus we need only calculate the mean and variance.

The mean is straightforward.

\displaystyle   \mathbb{E}[X_t \,\vert\, X_0 = x_0] =  \mathbb{E}\Bigg[\alpha + (x_0 - \alpha)e^{-\beta t} + \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s \,\vert\, X_0 = x_0\Bigg] =  \alpha + (x_0 - \alpha)e^{-\beta t}

Without loss of generality assume t \leq u and writing \mathbb{C} for covariance

\displaystyle   \begin{aligned}  \mathbb{C}[X_u, X_t \,\vert\, X_0 = x_0] &=  \mathbb{E}\Bigg[\Bigg( \sigma\int_0^u e^{-\beta(u - s)}\mathrm{d}W_s  \Bigg)                  \Bigg( \sigma\int_0^t e^{-\beta(t - s)}\mathrm{d}W_s  \Bigg)\Bigg] \\  &=  \sigma^2e^{-\beta(u + t)}  \mathbb{E}\Bigg[\Bigg(\int_0^u e^{\beta s}\mathrm{d}W_s\Bigg)            \Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s\Bigg)\Bigg] \\  &=  \sigma^2e^{-\beta(u + t)}  \mathbb{E}\Bigg[\Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s + \int_t^u e^{\beta s}\mathrm{d}W_s\Bigg)            \Bigg(\int_0^t e^{\beta s}\mathrm{d}W_s\Bigg)\Bigg]  \end{aligned}

And now we can use Ito and independence

\displaystyle   \begin{aligned}  \mathbb{C}[X_u, X_t \,\vert\, X_0 = x_0] &=  \sigma^2e^{-\beta(u + t)}\mathbb{E}\Bigg[ \int_0^t e^{2\beta s}\mathrm{d}s  \Bigg] \\  &=  \frac{\sigma^2e^{-\beta(u + t)}}{2\beta}\big(e^{2\beta t} - 1\big)  \end{aligned}

Substituting in t = u gives the desired result.

Bibliography

Kallenberg, O. 2002. Foundations of Modern Probability. Probability and Its Applications. Springer New York. http://books.google.co.uk/books?id=TBgFslMy8V4C.

Rogers, L. C. G., and David Williams. 2000. Diffusions, Markov Processes, and Martingales. Vol. 1. Cambridge Mathematical Library. Cambridge: Cambridge University Press.

A Type Safe Reverse or Some Hasochism

Conor McBride was not joking when he and his co-author entitled their paper about dependent typing in Haskell “Hasochism”: Lindley and McBride (2013).

In trying to resurrect the Haskell package yarr, it seemed that a dependently typed reverse function needed to be written. Writing such a function turns out to be far from straightforward. How GHC determines that a proof (program) discharges a proposition (type signature) is rather opaque and perhaps not surprisingly the error messages one gets if the proof is incorrect are far from easy to interpret.

I’d like to thank all the folk on StackOverflow whose answers and comments I have used freely below. Needless to say, any errors are entirely mine.

Here are two implementations, each starting from different axioms (NB: I have never seen type families referred to as axioms but it seems helpful to think about them in this way).

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE GADTs                        #-}
> {-# LANGUAGE KindSignatures               #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE TypeFamilies                 #-}
> {-# LANGUAGE UndecidableInstances         #-}
> {-# LANGUAGE ExplicitForAll               #-}
> {-# LANGUAGE TypeOperators                #-}
> {-# LANGUAGE ScopedTypeVariables          #-}

For both implementations, we need propositional equality: if a :~: b is inhabited by some terminating value, then the type a is the same as the type b. Further we need the normal form of an equality proof: Refl :: a :~: a and a function, gcastWith which allows us to use internal equality (:~:) to discharge a required proof of external equality (~). Readers familiar with topos theory, for example see Lambek and Scott (1988), will note that the notation is reversed.

> import Data.Type.Equality ( (:~:) (Refl), gcastWith )

For the second of the two approaches adumbrated we will need

> import Data.Proxy

The usual natural numbers:

> data Nat = Z | S Nat

We need some axioms:

  1. A left unit and
  2. Restricted commutativity.
> type family (n :: Nat) :+ (m :: Nat) :: Nat where
>     Z   :+ m = m
>     S n :+ m = n :+ S m

We need the usual singleton for Nat to tie types and terms together.

> data SNat :: Nat -> * where
>   SZero :: SNat Z
>   SSucc :: SNat n -> SNat (S n)

Now we can prove some lemmas.

First a lemma showing we can push :+ inside a successor, S.

> succ_plus_id :: SNat n1 -> SNat n2 -> (((S n1) :+ n2) :~: (S (n1 :+ n2)))
> succ_plus_id SZero _ = Refl
> succ_plus_id (SSucc n) m = gcastWith (succ_plus_id n (SSucc m)) Refl

This looks nothing like a standard mathematical proof and it’s hard to know what ghc is doing under the covers but presumably something like this:

  • For SZero
  1. S Z :+ n2 = Z :+ S n2 (by axiom 2) = S n2 (by axiom 1) and
  2. S (Z + n2) = S n2 (by axiom 1)
  3. So S Z :+ n2 = S (Z + n2)
  • For SSucc
  1. SSucc n :: SNat (S k) so n :: SNat k and m :: SNat i so SSucc m :: SNat (S i)
  2. succ_plus id n (SSucc m) :: k ~ S p => S p :+ S i :~: S (p :+ S i) (by hypothesis)
  3. k ~ S p => S p :+ S i :~: S (S p :+ i) (by axiom 2)
  4. k :+ S i :~: S (k :+ i) (by substitution)
  5. S k :+ i :~: S (k :+ i) (by axiom 2)

Second a lemma showing that Z is also the right unit.

> plus_id_r :: SNat n -> ((n :+ Z) :~: n)
> plus_id_r SZero = Refl
> plus_id_r (SSucc n) = gcastWith (plus_id_r n) (succ_plus_id n SZero)
  • For SZero
  1. Z :+ Z = Z (by axiom 1)
  • For SSucc
  1. SSucc n :: SNat (S k) so n :: SNat k
  2. plus_id_r n :: k :+ Z :~: k (by hypothesis)
  3. succ_plus_id n SZero :: S k :+ Z :~: S (k + Z) (by the succ_plus_id lemma)
  4. succ_plus_id n SZero :: k :+ Z ~ k => S k :+ Z :~: S k (by substitution)
  5. plus_id_r n :: k :+ Z :~: k (by equation 2)

Now we can defined vectors which have their lengths encoded in their type.

> infixr 4 :::
> data Vec a n where
>     Nil   :: Vec a Z
>     (:::) :: a -> Vec a n -> Vec a (S n)

We can prove a simple result using the lemma that Z is a right unit.

> elim0 :: SNat n -> (Vec a (n :+ Z) -> Vec a n)
> elim0 n x = gcastWith (plus_id_r n) x

Armed with this we can write an {\mathcal{O}}(n) reverse function in which the length of the result is guaranteed to be the same as the length of the argument.

> size :: Vec a n -> SNat n
> size Nil         = SZero
> size (_ ::: xs)  = SSucc $ size xs
> accrev :: Vec a n -> Vec a n
> accrev x = elim0 (size x) $ go Nil x where
>     go :: Vec a m -> Vec a n -> Vec a (n :+ m)
>     go acc  Nil       = acc
>     go acc (x ::: xs) = go (x ::: acc) xs
> toList :: Vec a n -> [a]
> toList  Nil       = []
> toList (x ::: xs) = x : toList xs
> test0 :: [String]
> test0 = toList $ accrev $ "a" ::: "b" ::: "c" ::: Nil
ghci> test0
  ["c","b","a"]

For an alternative approach, let us change the axioms slightly.

> type family (n1 :: Nat) + (n2 :: Nat) :: Nat where
>   Z + n2 = n2
>   (S n1) + n2 = S (n1 + n2)

Now the proof that Z is a right unit is more straightforward.

> plus_id_r1 :: SNat n -> ((n + Z) :~: n)
> plus_id_r1 SZero = Refl
> plus_id_r1 (SSucc n) = gcastWith (plus_id_r1 n) Refl

For the lemma showing we can push + inside a successor, S, we can use a Proxy.

> plus_succ_r1 :: SNat n1 -> Proxy n2 -> ((n1 + (S n2)) :~: (S (n1 + n2)))
> plus_succ_r1 SZero _ = Refl
> plus_succ_r1 (SSucc n1) proxy_n2 = gcastWith (plus_succ_r1 n1 proxy_n2) Refl

Now we can write our reverse function without having to calculate sizes.

> accrev1 :: Vec a n -> Vec a n
> accrev1 Nil = Nil
> accrev1 list = go SZero Nil list
>   where
>     go :: SNat n1 -> Vec a n1 -> Vec a n2 -> Vec a (n1 + n2)
>     go snat acc Nil = gcastWith (plus_id_r1 snat) acc
>     go snat acc (h ::: (t :: Vec a n3)) =
>       gcastWith (plus_succ_r1 snat (Proxy :: Proxy n3))
>                 (go (SSucc snat) (h ::: acc) t)
> test1 :: [String]
> test1 = toList $ accrev1 $ "a" ::: "b" ::: "c" ::: Nil
ghci> test0
  ["c","b","a"]

Bibliography

Lambek, J., and P.J. Scott. 1988. Introduction to Higher-Order Categorical Logic. Cambridge Studies in Advanced Mathematics. Cambridge University Press. http://books.google.co.uk/books?id=6PY_emBeGjUC.

Lindley, Sam, and Conor McBride. 2013. “Hasochism: The Pleasure and Pain of Dependently Typed Haskell Programming.” In Proceedings of the 2013 ACM SIGPLAN Symposium on Haskell, 81–92. Haskell ’13. New York, NY, USA: ACM. doi:10.1145/2503778.2503786.

Bayesian Analysis: A Conjugate Prior and Markov Chain Monte Carlo

Introduction

This is meant to be shorter blog post than normal with the expectation that the material will be developed further in future blog posts.

A Bayesian will have a prior view of the distribution of some data and then based on data, update that view. Mostly the updated distribution, the posterior, will not be expressible as an analytic function and sampling via Markov Chain Monte Carlo (MCMC) is the only way to determine it.

In some special cases, when the posterior is of the same family of distributions as the prior, then the posterior is available analytically and we call the posterior and prior conjugate. It turns out that the normal or Gaussian distribution is conjugate with respect to a normal likelihood distribution.

This gives us the opportunity to compare MCMC against the analytic solution and give ourselves more confidence that MCMC really does deliver the goods.

Some points of note:

  • Since we want to display the posterior (and the prior for that matter), for histograms we use the histogram-fill package.

  • Since we are using Monte Carlo we can use all the cores on our computer via one of Haskell’s parallelization mechanisms.

Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> module ConjMCMCSimple where
> 
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram ( asList )
> import qualified Data.Histogram as H
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List
> import Control.Parallel.Strategies
> 
> import Diagrams.Backend.Cairo.CmdLine
> 
> import Diagrams.Backend.CmdLine
> import Diagrams.Prelude hiding ( sample, render )
> 
> import LinRegAux

A Simple Example

Analytically

Suppose the prior is \mu \sim \cal{N}(\mu_0, \sigma_0), that is

\displaystyle   \pi(\mu) \propto \exp{\bigg( -\frac{(\mu - \mu_0)^2}{2\sigma_0^2}\bigg)}

Our data is IID normal, x_i \sim \cal{N}(\mu, \sigma), where \sigma is known, so the likelihood is

\displaystyle   p(x\,|\,\mu, \sigma) \propto \prod_{i=1}^n \exp{\bigg( -\frac{(x_i - \mu)^2}{2\sigma^2}\bigg)}

The assumption that \sigma is known is unlikely but the point of this post is to demonstrate MCMC matching an analytic formula.

This gives a posterior of

\displaystyle   \begin{aligned}  p(\mu\,|\, \boldsymbol{x}) &\propto \exp{\bigg(  -\frac{(\mu - \mu_0)^2}{2\sigma_0^2}  - \frac{\sum_{i=1}^n(x_i - \mu)^2}{2\sigma^2}\bigg)} \\  &\propto \exp{\bigg[-\frac{1}{2}\bigg(\frac{\mu^2 \sigma^2 -2\sigma^2\mu\mu_0 - 2\sigma_0^2n\bar{x}\mu + \sigma_0^2 n\mu^2}{\sigma^2\sigma_0^2}\bigg)\bigg]} \\  &= \exp{\bigg[-\frac{1}{2}\bigg(\frac{ (n\sigma_0^2 + \sigma^2)\mu^2 - 2(\sigma^2\mu_0 - \sigma_0^2n\bar{x})\mu}{\sigma^2\sigma_0^2}\bigg)\bigg]} \\  &= \exp{\Bigg[-\frac{1}{2}\Bigg(\frac{ \mu^2 - 2\mu\frac{(\sigma^2\mu_0 - \sigma_0^2n\bar{x})}{(n\sigma_0^2 + \sigma^2)}}{\frac{\sigma^2\sigma_0^2}{(n\sigma_0^2 + \sigma^2)}}\Bigg)\Bigg]} \\  &\propto \exp{\Bigg[-\frac{1}{2}\Bigg(\frac{\big(\mu - \frac{(\sigma^2\mu_0 - \sigma_0^2n\bar{x})}{(n\sigma_0^2 + \sigma^2)}\big)^2}{\frac{\sigma^2\sigma_0^2}{(n\sigma_0^2 + \sigma^2)}}\Bigg)\Bigg]}  \end{aligned}

In other words

\displaystyle   \mu\,|\, \boldsymbol{x} \sim {\cal{N}}\bigg(\frac{\sigma^2\mu_0 + n\sigma_0^2\bar{x}}{n\sigma_0^2 + \sigma^2}, \frac{\sigma^2\sigma_0^2}{n\sigma_0^2 + \sigma^2} \bigg)

Writing

\displaystyle   \sigma_n^2 = \frac{\sigma^2\sigma_0^2}{n\sigma_n^2 + \sigma^2}

we get

\displaystyle   \frac{1}{\sigma_n^2} = \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}

Thus the precision (the inverse of the variance) of the posterior is the precision of the prior plus the precision of the data scaled by the number of observations. This gives a nice illustration of how Bayesian statistics improves our beliefs.

Writing

\displaystyle   \mu_n = \frac{\sigma^2\mu_0 + n\sigma_0^2\bar{x}}{n\sigma_0^2 + \sigma^2}

and

\displaystyle   \lambda = 1 / \sigma^2, \, \lambda_0 = 1 / \sigma_0^2, \, \lambda_n = 1 / \sigma_n^2

we see that

\displaystyle   \mu_n = \frac{n\bar{x}\lambda + \mu_0\lambda_0}{\lambda_n}

Thus the mean of the posterior is a weight sum of the mean of the prior and the sample mean scaled by preciscion of the prior and the precision of the data itself scaled by the number of observations.

Rather arbitrarily let us pick a prior mean of

> mu0 :: Double
> mu0 = 11.0

and express our uncertainty about it with a largish prior variance

> sigma_0 :: Double
> sigma_0 = 2.0

And also arbitrarily let us pick the know variance for the samples as

> sigma :: Double
> sigma = 1.0

We can sample from this in way that looks very similar to STAN and JAGS:

> hierarchicalSample :: MonadRandom m => m Double
> hierarchicalSample = do
>   mu <- sample (Normal mu0 sigma_0)
>   x  <- sample (Normal mu sigma)
>   return x

and we didn’t need to write a new language for this.

Again arbitrarily let us take

> nSamples :: Int
> nSamples = 10

and use

> arbSeed :: Int
> arbSeed = 2

And then actually generate the samples.

> simpleXs :: [Double]
> simpleXs =
>   evalState (replicateM nSamples hierarchicalSample)
>             (pureMT $ fromIntegral arbSeed)

Using the formulae we did above we can calculate the posterior

> mu_1, sigma1, simpleNumerator :: Double
> simpleNumerator = fromIntegral nSamples * sigma_0**2 + sigma**2
> mu_1 = (sigma**2 * mu0 + sigma_0**2 * sum simpleXs) / simpleNumerator
> sigma1 = sigma**2 * sigma_0**2 / simpleNumerator

and then compare it against the prior

The red posterior shows we are a lot more certain now we have some evidence.

Via Markov Chain Monte Carlo

The theory behinde MCMC is described in a previous post. We need to generate some proposed steps for the chain. We sample from the normal distribution but we could have used e.g. the gamma.

> normalisedProposals :: Int -> Double -> Int -> [Double]
> normalisedProposals seed sigma nIters =
>   evalState (replicateM nIters (sample (Normal 0.0 sigma)))
>   (pureMT $ fromIntegral seed)

We also need samples from the uniform distribution

> acceptOrRejects :: Int -> Int -> [Double]
> acceptOrRejects seed nIters =
>   evalState (replicateM nIters (sample stdUniform))
>   (pureMT $ fromIntegral seed)

And now we can calculate the (un-normalised) prior, likelihood and posterior

> prior :: Double -> Double
> prior mu = exp (-(mu - mu0)**2 / (2 * sigma_0**2))
> 
> likelihood :: Double -> [Double] -> Double
> likelihood mu xs = exp (-sum (map (\x -> (x - mu)**2 / (2 * sigma**2)) xs))
> 
> posterior :: Double -> [Double] -> Double
> posterior mu xs = likelihood mu xs * prior mu

The Metropolis algorithm tells us that we always jump to a better place but only sometimes jump to a worse place. We count the number of acceptances as we go.

> acceptanceProb :: Double -> Double -> [Double] -> Double
> acceptanceProb mu mu' xs = min 1.0 ((posterior mu' xs) / (posterior mu xs))
> oneStep :: (Double, Int) -> (Double, Double) -> (Double, Int)
> oneStep (mu, nAccs) (proposedJump, acceptOrReject) =
>   if acceptOrReject < acceptanceProb mu (mu + proposedJump) simpleXs
>   then (mu + proposedJump, nAccs + 1)
>   else (mu, nAccs)

Now we can actually run our simulation. We set the number of jumps and a burn in but do not do any thinning.

> nIters, burnIn :: Int
> nIters = 300000
> burnIn = nIters `div` 10

Let us start our chain at

> startMu :: Double
> startMu = 10.0

and set the variance of the jumps to

> jumpVar :: Double
> jumpVar = 0.4
> test :: Int -> [(Double, Int)]
> test seed =
>   drop burnIn $
>   scanl oneStep (startMu, 0) $
>   zip (normalisedProposals seed jumpVar nIters)
>       (acceptOrRejects seed nIters)

We put the data into a histogram

> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
>   where
>     lower = startMu - 1.5*sigma_0
>     upper = startMu + 1.5*sigma_0
> 
> hist :: Int -> Histogram V.Vector BinD Double
> hist seed = fillBuilder hb (map fst $ test seed)

Not bad but a bit lumpy. Let’s try a few runs and see if we can smooth things out.

> hists :: [Histogram V.Vector BinD Double]
> hists = parMap rpar hist [3,4..102]
> emptyHist :: Histogram V.Vector BinD Double
> emptyHist = fillBuilder hb (replicate numBins 0)
> 
> smoothHist :: Histogram V.Vector BinD Double
> smoothHist = foldl' (H.zip (+)) emptyHist hists

Quite nice and had my machine running at 750% with +RTS -N8.

Comparison

Let’s create the same histogram but from the posterior created analytically.

> analPosterior :: [Double]
> analPosterior =
>   evalState (replicateM 100000 (sample (Normal mu_1 (sqrt sigma1))))
>   (pureMT $ fromIntegral 5)
> 
> histAnal :: Histogram V.Vector BinD Double
> histAnal = fillBuilder hb analPosterior

And then compare them. Because they overlap so well, we show the MCMC, both and the analytic on separate charts.

PostAmble

Normally with BlogLiteratelyD, we can generate diagrams on the fly. However, here we want to run the simulations in parallel so we need to actually compile something.

ghc -O2 ConjMCMCSimple.lhs -main-is ConjMCMCSimple -threaded -fforce-recomp
> displayHeader :: FilePath -> Diagram B R2 -> IO ()
> displayHeader fn =
>   mainRender ( DiagramOpts (Just 900) (Just 700) fn
>              , DiagramLoopOpts False Nothing 0
>              )
> main :: IO ()
> main = do
>   displayHeader "https://idontgetoutmuch.files.wordpress.com/2014/03/HistMCMC.png"
>     (barDiag MCMC
>      (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
>      (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
> 
>   displayHeader "https://idontgetoutmuch.files.wordpress.com/2014/03/HistMCMCAnal.png"
>     (barDiag MCMCAnal
>      (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
>      (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
> 
>   displayHeader "https://idontgetoutmuch.files.wordpress.com/2014/03/HistAnal.png"
>     (barDiag Anal
>      (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
>      (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
> 
>   displayHeader "https://idontgetoutmuch.files.wordpress.com/2014/03/SmoothHistMCMC.png"
>     (barDiag MCMC
>      (zip (map fst $ asList smoothHist) (map snd $ asList smoothHist))
>      (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))