Ecology, Dynamical Systems and Inference via PMMH

Introduction

In the 1920s, Lotka (1909) and Volterra (1926) developed a model of a very simple predator-prey ecosystem.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1  - c_1 N_1 N_2 \label{eq2a} \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & c_2 N_1 N_2 - \rho_2 N2 \label{eq2b}  \end{aligned}

Although simple, it turns out that the Canadian lynx and showshoe hare are well represented by such a model. Furthermore, the Hudson Bay Company kept records of how many pelts of each species were trapped for almost a century, giving a good proxy of the population of each species.

We can capture the fact that we do not have a complete model by describing our state of ignorance about the parameters. In order to keep this as simple as possible let us assume that log parameters undergo Brownian motion. That is, we know the parameters will jiggle around and the further into the future we look the less certain we are about what values they will have taken. By making the log parameters undergo Brownian motion, we can also capture our modelling assumption that birth, death and predation rates are always positive. A similar approach is taken in Dureau, Kalogeropoulos, and Baguelin (2013) where the (log) parameters of an epidemiological model are taken to be Ornstein-Uhlenbeck processes (which is biologically more plausible although adds to the complexity of the model, something we wish to avoid in an example such as this).

Andrieu, Doucet, and Holenstein (2010) propose a method to estimate the parameters of such models (Particle Marginal Metropolis Hastings aka PMMH) and the domain specific probabilistic language LibBi (Murray (n.d.)) can be used to apply this (and other inference methods).

For the sake of simplicity, in this blog post, we only model one parameter as being unknown and undergoing Brownian motion. A future blog post will consider more sophisticated scenarios.

A Dynamical System Aside

The above dynamical system is structurally unstable (more on this in a future post), a possible indication that it should not be considered as a good model of predator–prey interaction. Let us modify this to include carrying capacities for the populations of both species.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1 \bigg(1 - \frac{N_1}{K_1}\bigg) - c_1 N_1 N_2 \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & -\rho_2 N_2 \bigg(1 + \frac{N_2}{K_2}\bigg) + c_2 N_1 N_2  \end{aligned}

Data Generation with LibBi

Let’s generate some data using LibBi.

// Generate data assuming a fixed growth rate for hares rather than
// e.g. a growth rate that undergoes Brownian motion.

model PP {
  const h         = 0.1;    // time step
  const delta_abs = 1.0e-3; // absolute error tolerance
  const delta_rel = 1.0e-6; // relative error tolerance

  const a  = 5.0e-1 // Hare growth rate
  const k1 = 2.0e2  // Hare carrying capacity
  const b  = 2.0e-2 // Hare death rate per lynx
  const d  = 4.0e-1 // Lynx death rate
  const k2 = 2.0e1  // Lynx carrying capacity
  const c  = 4.0e-3 // Lynx birth rate per hare

  state P, Z       // Hares and lynxes
  state ln_alpha   // Hare growth rate - we express it in log form for
                   // consistency with the inference model
  obs P_obs        // Observations of hares

  sub initial {
    P ~ log_normal(log(100.0), 0.2)
    Z ~ log_normal(log(50.0), 0.1)
  }

  sub transition(delta = h) {
    ode(h = h, atoler = delta_abs, rtoler = delta_rel, alg = 'RK4(3)') {
      dP/dt =  a * P * (1 - P / k1) - b * P * Z
      dZ/dt = -d * Z * (1 + Z / k2) + c * P * Z
    }
  }

  sub observation {
    P_obs ~ log_normal(log(P), 0.1)
  }
}

We can look at phase space starting with different populations and see they all converge to the same fixed point.

Data Generation with Haskell

Since at some point in the future, I plan to produce Haskell versions of the methods given in Andrieu, Doucet, and Holenstein (2010), let’s generate the data using Haskell.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> module LotkaVolterra (
>     solLv
>   , solPp
>   , h0
>   , l0
>   , baz
>   , logBM
>   , eulerEx
>   )where
> import Numeric.GSL.ODE
> import Numeric.LinearAlgebra
> import Data.Random.Source.PureMT
> import Data.Random hiding ( gamma )
> import Control.Monad.State

Here’s the unstable model.

> lvOde :: Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          [Double] ->
>          [Double]
> lvOde rho1 c1 rho2 c2 _t [h, l] =
>   [
>     rho1 * h - c1 * h * l
>   , c2 * h * l - rho2 * l
>   ]
> lvOde _rho1 _c1 _rho2 _c2 _t vars =
>   error $ "lvOde called with: " ++ show (length vars) ++ " variable"
> rho1, c1, rho2, c2 :: Double
> rho1 = 0.5
> c1 = 0.02
> rho2 = 0.4
> c2 = 0.004
> deltaT :: Double
> deltaT = 0.1
> solLv :: Matrix Double
> solLv = odeSolve (lvOde rho1 c1 rho2 c2)
>                  [50.0, 50.0]
>                  (fromList [0.0, deltaT .. 50])

And here’s the stable model.

> ppOde :: Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          [Double] ->
>          [Double]
> ppOde a k1 b d k2 c _t [p, z] =
>   [
>     a * p * (1 - p / k1) - b * p * z
>   , -d * z * (1 + z / k2) + c * p * z
>   ]
> ppOde _a _k1 _b _d _k2 _c _t vars =
>   error $ "ppOde called with: " ++ show (length vars) ++ " variable"
> a, k1, b, d, k2, c :: Double
> a = 0.5
> k1 = 200.0
> b = 0.02
> d = 0.4
> k2 = 50.0
> c = 0.004
> solPp :: Double -> Double -> Matrix Double
> solPp x y = odeSolve (ppOde a k1 b d k2 c)
>                  [x, y]
>                  (fromList [0.0, deltaT .. 50])
> gamma, alpha, beta :: Double
> gamma = d / a
> alpha = a / (c * k1)
> beta  = d / (a * k2)
> fp :: (Double, Double)
> fp = ((gamma + beta) / (1 + alpha * beta), (1 - gamma * alpha) / (1 + alpha * beta))
> h0, l0 :: Double
> h0 = a * fst fp / c
> l0 = a * snd fp / b
> foo, bar :: Matrix R
> foo = matrix 2 [a / k1, b, c, negate d / k2]
> bar = matrix 1 [a, d]
> baz :: Maybe (Matrix R)
> baz = linearSolve foo bar

This gives a stable fixed point of

ghci> baz
  Just (2><1)
   [ 120.00000000000001
   ,               10.0 ]

Here’s an example of convergence to that fixed point in phase space.

The Stochastic Model

Let us now assume that the Hare growth parameter undergoes Brownian motion so that the further into the future we go, the less certain we are about it. In order to ensure that this parameter remains positive, let’s model the log of it to be Brownian motion.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1 \bigg(1 - \frac{N_1}{K_1}\bigg) - c_1 N_1 N_2 \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & -\rho_2 N_2 \bigg(1 + \frac{N_2}{K_2}\bigg) + c_2 N_1 N_2 \\  \mathrm{d} \rho_1 & = & \rho_1 \sigma_{\rho_1} \mathrm{d}W_t  \end{aligned}

where the final equation is a stochastic differential equation with W_t being a Wiener process.

By Itô we have

\displaystyle   \mathrm{d} (\log{\rho_1}) = - \frac{\sigma_{\rho_1}^2}{2} \mathrm{d} t + \sigma_{\rho_1} \mathrm{d}W_t

We can use this to generate paths for \rho_1.

\displaystyle   \rho_1(t + \Delta t) = \rho_1(t)\exp{\bigg(- \frac{\sigma_{\rho_1}^2}{2} \Delta t + \sigma_{\rho_1} \sqrt{\Delta t} Z\bigg)}

where Z \sim {\mathcal{N}}(0,1).

> oneStepLogBM :: MonadRandom m => Double -> Double -> Double -> m Double
> oneStepLogBM deltaT sigma rhoPrev = do
>   x <- sample $ rvar StdNormal
>   return $ rhoPrev * exp(sigma * (sqrt deltaT) * x - 0.5 * sigma * sigma * deltaT)
> iterateM :: Monad m => (a -> m a) -> m a -> Int -> m [a]
> iterateM f mx n = sequence . take n . iterate (>>= f) $ mx
> logBMM :: MonadRandom m => Double -> Double -> Int -> Int -> m [Double]
> logBMM initRho sigma n m =
>   iterateM (oneStepLogBM (recip $ fromIntegral n) sigma) (return initRho) (n * m)
> logBM :: Double -> Double -> Int -> Int -> Int -> [Double]
> logBM initRho sigma n m seed =
>   evalState (logBMM initRho sigma n m) (pureMT $ fromIntegral seed)

We can see the further we go into the future the less certain we are about the value of the parameter.

Using this we can simulate the whole dynamical system which is now a stochastic process.

> f1, f2 :: Double -> Double -> Double ->
>           Double -> Double ->
>           Double
> f1 a k1 b p z = a * p * (1 - p / k1) - b * p * z
> f2 d k2 c p z = -d * z * (1 + z / k2) + c * p * z
> oneStepEuler :: MonadRandom m =>
>                 Double ->
>                 Double ->
>                 Double -> Double ->
>                 Double -> Double -> Double ->
>                 (Double, Double, Double) ->
>                 m (Double, Double, Double)
> oneStepEuler deltaT sigma k1 b d k2 c (rho1Prev, pPrev, zPrev) = do
>     let pNew = pPrev + deltaT * f1 (exp rho1Prev) k1 b pPrev zPrev
>     let zNew = zPrev + deltaT * f2 d k2 c pPrev zPrev
>     rho1New <- oneStepLogBM deltaT sigma rho1Prev
>     return (rho1New, pNew, zNew)
> euler :: MonadRandom m =>
>          (Double, Double, Double) ->
>          Double ->
>          Double -> Double ->
>          Double -> Double -> Double ->
>          Int -> Int ->
>          m [(Double, Double, Double)]
> euler stateInit sigma k1 b d k2 c n m =
>   iterateM (oneStepEuler (recip $ fromIntegral n) sigma k1 b d k2 c)
>            (return stateInit)
>            (n * m)
> eulerEx :: (Double, Double, Double) ->
>            Double -> Int -> Int -> Int ->
>            [(Double, Double, Double)]
> eulerEx stateInit sigma n m seed =
>   evalState (euler stateInit sigma k1 b d k2 c n m) (pureMT $ fromIntegral seed)

We see that the populations become noisier the further into the future we go.

Notice that the second order effects of the system are now to some extent captured by the fact that the growth rate of Hares can drift. In our simulation, this is demonstrated by our decreasing lack of knowledge the further we look into the future.

Inference

Now let us infer the growth rate using PMMH. Here’s the model expressed in LibBi.

// Infer growth rate for hares

model PP {
  const h         = 0.1;    // time step
  const delta_abs = 1.0e-3; // absolute error tolerance
  const delta_rel = 1.0e-6; // relative error tolerance

  const a  = 5.0e-1 // Hare growth rate - superfluous for inference
                    // but a reminder of what we should expect
  const k1 = 2.0e2  // Hare carrying capacity
  const b  = 2.0e-2 // Hare death rate per lynx
  const d  = 4.0e-1 // Lynx death rate
  const k2 = 2.0e1  // Lynx carrying capacity
  const c  = 4.0e-3 // Lynx birth rate per hare

  state P, Z       // Hares and lynxes
  state ln_alpha   // Hare growth rate - we express it in log form for
                   // consistency with the inference model
  obs P_obs        // Observations of hares
  param mu, sigma  // Mean and standard deviation of hare growth rate
  noise w          // Noise

  sub parameter {
    mu ~ uniform(0.0, 1.0)
    sigma ~ uniform(0.0, 0.5)
  }

  sub proposal_parameter {
     mu ~ truncated_gaussian(mu, 0.02, 0.0, 1.0);
     sigma ~ truncated_gaussian(sigma, 0.01, 0.0, 0.5);
   }

  sub initial {
    P ~ log_normal(log(100.0), 0.2)
    Z ~ log_normal(log(50.0), 0.1)
    ln_alpha ~ gaussian(log(mu), sigma)
  }

  sub transition(delta = h) {
    w ~ normal(0.0, sqrt(h));
    ode(h = h, atoler = delta_abs, rtoler = delta_rel, alg = 'RK4(3)') {
      dP/dt =  exp(ln_alpha) * P * (1 - P / k1) - b * P * Z
      dZ/dt = -d * Z * (1 + Z / k2) + c * P * Z
      dln_alpha/dt = -sigma * sigma / 2 - sigma * w / h
    }
  }

  sub observation {
    P_obs ~ log_normal(log(P), 0.1)
  }
}

Let’s look at the posteriors of the hyper-parameters for the Hare growth parameter.

The estimate for \mu is pretty decent. For our generated data, \sigma =0 and given our observations are quite noisy maybe the estimate for this is not too bad also.

Appendix: The R Driving Code

All code including the R below can be downloaded from github but make sure you use the straight-libbi branch and not master.

install.packages("devtools")
library(devtools)
install_github("sbfnk/RBi",ref="master")
install_github("sbfnk/RBi.helpers",ref="master")

rm(list = ls(all.names=TRUE))
unlink(".RData")

library('RBi')
try(detach(package:RBi, unload = TRUE), silent = TRUE)
library(RBi, quietly = TRUE)

library('RBi.helpers')

library('ggplot2', quietly = TRUE)
library('gridExtra', quietly = TRUE)

endTime <- 50

PP <- bi_model("PP.bi")
synthetic_dataset_PP <- bi_generate_dataset(endtime=endTime,
                                            model=PP,
                                            seed="42",
                                            verbose=TRUE,
                                            add_options = list(
                                                noutputs=500))

rdata_PP <- bi_read(synthetic_dataset_PP)

df <- data.frame(rdata_PP$P$nr,
                 rdata_PP$P$value,
                 rdata_PP$Z$value,
                 rdata_PP$P_obs$value)

ggplot(df, aes(rdata_PP$P$nr, y = Population, color = variable), size = 0.1) +
    geom_line(aes(y = rdata_PP$P$value, col = "Hare"), size = 0.1) +
    geom_line(aes(y = rdata_PP$Z$value, col = "Lynx"), size = 0.1) +
    geom_point(aes(y = rdata_PP$P_obs$value, col = "Observations"), size = 0.1) +
    theme(legend.position="none") +
    ggtitle("Example Data") +
    xlab("Days") +
    theme(axis.text=element_text(size=4),
          axis.title=element_text(size=6,face="bold")) +
    theme(plot.title = element_text(size=10))
ggsave(filename="diagrams/LVdata.png",width=4,height=3)

synthetic_dataset_PP1 <- bi_generate_dataset(endtime=endTime,
                                             model=PP,
                                             init = list(P = 100, Z=50),
                                             seed="42",
                                             verbose=TRUE,
                                             add_options = list(
                                                 noutputs=500))

rdata_PP1 <- bi_read(synthetic_dataset_PP1)

synthetic_dataset_PP2 <- bi_generate_dataset(endtime=endTime,
                                             model=PP,
                                             init = list(P = 150, Z=25),
                                             seed="42",
                                             verbose=TRUE,
                                             add_options = list(
                                                 noutputs=500))

rdata_PP2 <- bi_read(synthetic_dataset_PP2)

df1 <- data.frame(hare = rdata_PP$P$value,
                  lynx = rdata_PP$Z$value,
                  hare1 = rdata_PP1$P$value,
                  lynx1 = rdata_PP1$Z$value,
                  hare2 = rdata_PP2$P$value,
                  lynx2 = rdata_PP2$Z$value)

ggplot(df1) +
    geom_path(aes(x=df1$hare,  y=df1$lynx, col = "0"), size = 0.1) +
    geom_path(aes(x=df1$hare1, y=df1$lynx1, col = "1"), size = 0.1) +
    geom_path(aes(x=df1$hare2, y=df1$lynx2, col = "2"), size = 0.1) +
    theme(legend.position="none") +
    ggtitle("Phase Space") +
    xlab("Hare") +
    ylab("Lynx") +
    theme(axis.text=element_text(size=4),
          axis.title=element_text(size=6,face="bold")) +
    theme(plot.title = element_text(size=10))
ggsave(filename="diagrams/PPviaLibBi.png",width=4,height=3)

PPInfer <- bi_model("PPInfer.bi")

bi_object_PP <- libbi(client="sample", model=PPInfer, obs = synthetic_dataset_PP)

bi_object_PP$run(add_options = list(
                     "end-time" = endTime,
                     noutputs = endTime,
                     nsamples = 4000,
                     nparticles = 128,
                     seed=42,
                     nthreads = 1),
                 ## verbose = TRUE,
                 stdoutput_file_name = tempfile(pattern="pmmhoutput", fileext=".txt"))

bi_file_summary(bi_object_PP$result$output_file_name)

mu <- bi_read(bi_object_PP, "mu")$value
g1 <- qplot(x = mu[2001:4000], y = ..density.., geom = "histogram") + xlab(expression(mu))
sigma <- bi_read(bi_object_PP, "sigma")$value
g2 <- qplot(x = sigma[2001:4000], y = ..density.., geom = "histogram") + xlab(expression(sigma))
g3 <- grid.arrange(g1, g2)
ggsave(plot=g3,filename="diagrams/LvPosterior.png",width=4,height=3)


df2 <- data.frame(hareActs = rdata_PP$P$value,
                  hareObs  = rdata_PP$P_obs$value)

ggplot(df, aes(rdata_PP$P$nr, y = value, color = variable)) +
    geom_line(aes(y = rdata_PP$P$value, col = "Phyto")) +
    geom_line(aes(y = rdata_PP$Z$value, col = "Zoo")) +
    geom_point(aes(y = rdata_PP$P_obs$value, col = "Phyto Obs"))

ln_alpha <- bi_read(bi_object_PP, "ln_alpha")$value

P <- matrix(bi_read(bi_object_PP, "P")$value,nrow=51,byrow=TRUE)
Z <- matrix(bi_read(bi_object_PP, "Z")$value,nrow=51,byrow=TRUE)

data50 <- bi_generate_dataset(endtime=endTime,
                              model=PP,
                              seed="42",
                              verbose=TRUE,
                              add_options = list(
                                  noutputs=50))

rdata50 <- bi_read(data50)

df3 <- data.frame(days = c(1:51), hares = rowMeans(P), lynxes = rowMeans(Z),
                                  actHs = rdata50$P$value, actLs = rdata50$Z$value)


ggplot(df3) +
    geom_line(aes(x = days, y = hares, col = "Est Phyto")) +
    geom_line(aes(x = days, y = lynxes, col = "Est Zoo")) +
    geom_line(aes(x = days, y = actHs, col = "Act Phyto")) +
    geom_line(aes(x = days, y = actLs, col = "Act Zoo"))

Bibliography

Andrieu, Christophe, Arnaud Doucet, and Roman Holenstein. 2010. “Particle Markov chain Monte Carlo methods.” Journal of the Royal Statistical Society. Series B: Statistical Methodology 72 (3): 269–342. doi:10.1111/j.1467-9868.2009.00736.x.

Dureau, Joseph, Konstantinos Kalogeropoulos, and Marc Baguelin. 2013. “Capturing the time-varying drivers of an epidemic using stochastic dynamical systems.” Biostatistics (Oxford, England) 14 (3): 541–55. doi:10.1093/biostatistics/kxs052.

Lotka, Alfred J. 1909. “Contribution to the Theory of Periodic Reactions.” The Journal of Physical Chemistry 14 (3): 271–74. doi:10.1021/j150111a004.

Murray, Lawrence M. n.d. “Bayesian State-Space Modelling on High-Performance Hardware Using LibBi.”

Volterra, Vito. 1926. “Variazioni e fluttuazioni del numero d’individui in specie animali conviventi.” Memorie Della R. Accademia Dei Lincei 6 (2): 31–113. http://www.liberliber.it/biblioteca/v/volterra/variazioni{\_}e{\_}fluttuazioni/pdf/volterra{\_}variazioni{\_}e{\_}fluttuazioni.pdf.

Fun with LibBi and Influenza

Introduction

This is a bit different from my usual posts (well apart from my write up of hacking at Odessa) in that it is a log of how I managed to get LibBi (Library for Bayesian Inference) to run on my MacBook and then not totally satisfactorily (as you will see if you read on).

The intention is to try a few more approaches to the same problem, for example, Stan, monad-bayes and hand-crafted.

Kermack and McKendrick (1927) give a simple model of the spread of an infectious disease. Individuals move from being susceptible (S) to infected (I) to recovered (R).

\displaystyle   \begin{aligned}  \frac{dS}{dt} & = & - \delta S(t) I(t) & \\  \frac{dI}{dt} & = & \delta S(t) I(t) - & \gamma I(t) \\  \frac{dR}{dt} & = &                    & \gamma I(t)  \end{aligned}

In 1978, anonymous authors sent a note to the British Medical Journal reporting an influenza outbreak in a boarding school in the north of England (“Influenza in a boarding school” 1978). The chart below shows the solution of the SIR (Susceptible, Infected, Record) model with parameters which give roughly the results observed in the school.

LibBi

Step 1

~/LibBi-stable/SIR-master $ ./init.sh
error: 'ncread' undefined near line 6 column 7

The README says this is optional so we can skip over it. Still it would be nice to fit the bridge weight function as described in Moral and Murray (2015).

The README does say that GPML is required but since we don’t (yet) need to do this step, let’s move on.

~/LibBi-stable/SIR-master $ ./run.sh
./run.sh

Error: ./configure failed with return code 77. See
.SIR/build_openmp_cuda_single/configure.log and
.SIR/build_openmp_cuda_single/config.log for details

It seems the example is configured to run on CUDA and it is highly likely that my installation of LibBI was not set up to allow this. We can change config.conf from

--disable-assert
--enable-single
--enable-cuda
--nthreads 2

to

--nthreads 4
--enable-sse
--disable-assert

On to the next issue.

~/LibBi-stable/SIR-master $ ./run.sh
./run.sh
Error: ./configure failed with return code 1. required QRUpdate
library not found. See .SIR/build_sse/configure.log and
.SIR/build_sse/config.log for details

But QRUpdate is installed!

~/LibBi-stable/SIR-master $ brew info QRUpdate
brew info QRUpdate
homebrew/science/qrupdate: stable 1.1.2 (bottled)
http://sourceforge.net/projects/qrupdate/
/usr/local/Cellar/qrupdate/1.1.2 (3 files, 302.6K)
/usr/local/Cellar/qrupdate/1.1.2_2 (6 files, 336.3K)
  Poured from bottle
/usr/local/Cellar/qrupdate/1.1.2_3 (6 files, 337.3K) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew-science/blob/master/qrupdate.rb
==> Dependencies
Required: veclibfort ✔
Optional: openblas ✔
==> Options
--with-openblas
	Build with openblas support
--without-check
	Skip build-time tests (not recommended)

Let’s look in the log as advised. So it seems that a certain symbol cannot be found.

checking for dch1dn_ in -lqrupdate

Let’s try ourselves.

nm -g /usr/local/Cellar/qrupdate/1.1.2_3/lib/libqrupdate.a | grep dch1dn_
0000000000000000 T _dch1dn_

So the symbol is there! What gives? Let’s try setting one of the environment variables.

export LDFLAGS='-L/usr/local/lib'

Now we get further.

./run.sh
Error: ./configure failed with return code 1. required NetCDF header
not found. See .SIR/build_sse/configure.log and
.SIR/build_sse/config.log for details

So we just need to set another environment variable.

export CPPFLAGS='-I/usr/local/include/'

This is more mysterious.

./run.sh
Error: ./configure failed with return code 1. required Boost header
not found. See .SIR/build_sse/configure.log and
.SIR/build_sse/config.log for details ~/LibBi-stable/SIR-master

Let’s see what we have.

brew list | grep -i boost

Nothing! I recall having some problems with boost when trying to use a completely different package. So let’s install boost.

brew install boost

Now we get a different error.

./run.sh
Error: make failed with return code 2, see .SIR/build_sse/make.log for details

Fortunately at some time in the past sbfnk took pity on me and advised me here to use boost155, a step that should not be lightly undertaken.

/usr/local/Cellar/boost155/1.55.0_1: 10,036 files, 451.6M, built in 15 minutes 9 seconds

Even then I had to say

brew link --force boost155

Finally it runs.

./run.sh 2> out.txt

And produces a lot of output

wc -l out.txt
   49999 out.txt

ls -ltrh results/posterior.nc
   1.7G Apr 30 19:57 results/posterior.nc

Rather worringly, out.txt has all lines of the form

1: -51.9191 -23.2045 nan beats -inf -inf -inf accept=0.5

nan beating -inf does not sound good.

Now we are in a position to analyse the results.

octave --path oct/ --eval "plot_and_print"
error: 'bi_plot_quantiles' undefined near line 23 column 5

I previously found an Octave package(?) called OctBi so let’s create an .octaverc file which adds this to the path. We’ll also need to load the netcdf package which we previously installed.

addpath ("../OctBi-stable/inst")
pkg load netcdf
~/LibBi-stable/SIR-master $ octave --path oct/ --eval "plot_and_print"
octave --path oct/ --eval "plot_and_print"
warning: division by zero
warning: called from
    mean at line 117 column 7
    read_hist_simulator at line 47 column 11
    bi_read_hist at line 85 column 12
    bi_hist at line 63 column 12
    plot_and_print at line 56 column 5
warning: division by zero
warning: division by zero
warning: division by zero
warning: division by zero
warning: division by zero
warning: print.m: fig2dev binary is not available.
Some output formats are not available.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
warning: opengl_renderer: x/y/zdata should have the same dimensions. Not rendering.
sh: pdfcrop: command not found

I actually get a chart from this so some kind of success.

This does not look like the chart in the Moral and Murray (2015), the fitted number of infected patients looks a lot smoother and the “rates” parameters also vary in a much smoother manner. For reasons I haven’t yet investigated, it looks like over-fitting. Here’s the charts in the paper.

Bibliography

“Influenza in a boarding school.” 1978. British Medical Journal, March, 587.

Kermack, W. O., and A. G. McKendrick. 1927. “A Contribution to the Mathematical Theory of Epidemics.” Proceedings of the Royal Society of London Series A 115 (August): 700–721. doi:10.1098/rspa.1927.0118.

Moral, Pierre Del, and Lawrence M Murray. 2015. “Sequential Monte Carlo with Highly Informative Observations.”

Naive Particle Smoothing is Degenerate

Introduction

Let \{X_t\}_{t \geq 1} be a (hidden) Markov process. By hidden, we mean that we are not able to observe it.

\displaystyle  X_1 \sim \mu(\centerdot) \quad X_t \,|\, (X_{t-1} = x) \sim f(\centerdot \,|\, x)

And let \{Y_t\}_{t \geq 1} be an observable Markov process such that

\displaystyle  Y_t \,|\, (X_{t} = x) \sim g(\centerdot \,|\, x)

That is the observations are conditionally independent given the state of the hidden process.

As an example let us take the one given in Särkkä (2013) where the movement of a car is given by Newton’s laws of motion and the acceleration is modelled as white noise.

\displaystyle  \begin{aligned} X_t &= AX_{t-1} + Q \epsilon_t \\ Y_t &= HX_t + R \eta_t \end{aligned}

Although we do not do so here, A, Q, H and R can be derived from the dynamics. For the purpose of this blog post, we note that they are given by

\displaystyle  A = \begin{bmatrix} 1 & 0 & \Delta t & 0        \\ 0 & 1 & 0        & \Delta t \\ 0 & 0 & 1        & 0        \\ 0 & 0 & 0        & 1 \end{bmatrix} , \quad Q = \begin{bmatrix} \frac{q_1^c \Delta t^3}{3} & 0                          & \frac{q_1^c \Delta t^2}{2} & 0                          \\ 0                          & \frac{q_2^c \Delta t^3}{3} & 0                          & \frac{q_2^c \Delta t^2}{2} \\ \frac{q_1^c \Delta t^2}{2} & 0                          & {q_1^c \Delta t}           & 0                          \\ 0                          & \frac{q_2^c \Delta t^2}{2} & 0                          & {q_2^c \Delta t} \end{bmatrix}

and

\displaystyle  H = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix} , \quad R = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{bmatrix}

We wish to determine the position and velocity of the car given noisy observations of the position. In general we need the distribution of the hidden path given the observable path. We use the notation x_{m:n} to mean the path of x starting a m and finishing at n.

\displaystyle  p(x_{1:n} \,|\, y_{1:n}) = \frac{p(x_{1:n}, y_{1:n})}{p(y_{1:n})}

Haskell Preamble

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE FlexibleInstances            #-}
> {-# LANGUAGE MultiParamTypeClasses        #-}
> {-# LANGUAGE FlexibleContexts             #-}
> {-# LANGUAGE TypeFamilies                 #-}
> {-# LANGUAGE BangPatterns                 #-}
> {-# LANGUAGE GeneralizedNewtypeDeriving   #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE TemplateHaskell              #-}
> module ParticleSmoothing
>   ( simpleSamples
>   , carSamples
>   , testCar
>   , testSimple
>   ) where
> import Data.Random.Source.PureMT
> import Data.Random hiding ( StdNormal, Normal )
> import qualified Data.Random as R
> import Control.Monad.State
> import Control.Monad.Writer hiding ( Any, All )
> import qualified Numeric.LinearAlgebra.HMatrix as H
> import Foreign.Storable ( Storable )
> import Data.Maybe ( fromJust )
> import Data.Bits ( shiftR )
> import qualified Data.Vector as V
> import qualified Data.Vector.Unboxed as U
> import Control.Monad.ST
> import System.Random.MWC
> import Data.Array.Repa ( Z(..), (:.)(..), Any(..), computeP,
>                          extent, DIM1, DIM2, slice, All(..)
>                        )
> import qualified Data.Array.Repa as Repa
> import qualified Control.Monad.Loops as ML
> import PrettyPrint ()
> import Text.PrettyPrint.HughesPJClass ( Pretty, pPrint )
> import Data.Vector.Unboxed.Deriving

Some Theory

If we could sample X_{1:n}^{(i)} \sim p(x_{1:n} \,|\, y_{1:n}) then we could approximate the posterior as

\displaystyle  \hat{p}(x_{1:n} \,|\, y_{1:n}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:n}^{(i)}}(x_{1:n})

If we wish to, we can create marginal estimates

\displaystyle  \hat{p}(x_k \,|\, y_{1:n}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{k}^{(i)}}(x_{k})

When k = N, this is the filtering estimate.

Standard Bayesian Recursion

Prediction

\displaystyle  \begin{aligned} p(x_n \,|\, y_{1:n-1}) &= \int p(x_{n-1:n} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\  &= \int p(x_{n} \,|\, x_{n-1}, y_{1:n-1}) \, p(x_{n-1} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\  &= \int f(x_{n} \,|\, x_{n-1}) \, p(x_{n-1} \,|\, y_{1:n-1}) \,\mathrm{d}x_{n-1} \\ \end{aligned}

Update

\displaystyle  \begin{aligned} p(x_n \,|\, y_{1:n}) &= \frac{p(y_n \,|\, x_n, y_{1:n-1}) \, p(x_n \,|\, y_{1:n-1})}                              {p(y_n \,|\, y_{1:n-1})} \\                      &= \frac{g(y_n \,|\, x_n) \, p(x_n \,|\, y_{1:n-1})}                              {p(y_n \,|\, y_{1:n-1})} \end{aligned}

where by definition

\displaystyle  {p(y_n \,|\, y_{1:n-1})} = \int {g(y_n \,|\, x_n) \, p(x_n \,|\, y_{1:n-1})} \,\mathrm{d}x_n

Path Space Recursion

We have

\displaystyle  \begin{aligned} p(x_{1:n} \,|\, y_{1:n}) &= \frac{p(x_{1:n}, y_{1:n})}{p(y_{1:n})} \\ &= \frac{p(x_n, y_n \,|\, x_{1:n-1}, y_{1:n-1})}{p(y_{1:n})} \, p(x_{1:n-1}, y_{1:n-1}) \\ &= \frac{p(y_n \,|\, x_{1:n}, y_{1:n-1}) \, p(x_n \,|\, x_{1:n-1}, y_{1:n-1}) }{p(y_{1:n})} \, p(x_{1:n-1}, y_{1:n-1}) \\ &= \frac{g(y_n \,|\, x_{n}) \, f(x_n \,|\, x_{n-1})}{p(y_n \,|\, y_{1:n-1})} \, \frac{p(x_{1:n-1}, y_{1:n-1})}{ \, p(y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, f(x_n \,|\, x_{n-1})}{p(y_n \,|\, y_{1:n-1})} \, {p(x_{1:n-1} \,|\,y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, \overbrace{f(x_n \,|\, x_{n-1}) \, {p(x_{1:n-1} \,|\,y_{1:n-1})}}^{\mathrm{predictive}\,p(x_{1:n} \,|\, y_{1:n-1})}} {p(y_n \,|\, y_{1:n-1})} \\ \end{aligned}

where by definition

\displaystyle  p(y_n \,|\, y_{1:n-1}) = \int g(y_n \,|\, x_n) \, p(x_{1:n} \,|\, y_{1:n-1}) \,\mathrm{d}x_{1:n}

Prediction

\displaystyle  p(x_{1:n} \,|\, y_{1:n-1}) = f(x_n \,|\, x_{n-1}) \, {p(x_{1:n-1} \,|\,y_{1:n-1})}

Update

\displaystyle  p(x_{1:n} \,|\, y_{1:n}) = \frac{g(y_n \,|\, x_{n}) \, {p(x_{1:n} \,|\, y_{1:n-1})}} {p(y_n \,|\, y_{1:n-1})}

Algorithm

The idea is to simulate paths using the recursion we derived above.

At time n-1 we have an approximating distribution

\displaystyle  \hat{p}(x_{1:n-1} \,|\, y_{1:n-1}) = \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:n-1}}^{(i)}(x_{1:n-1})

Sample \tilde{X}_n^{(i)} \sim f(\centerdot \,|\, X_{n-1}^{(i)}) and set \tilde{X}_{1:n}^{(i)} = (\tilde{X}_{1:n-1}^{(i)}, \tilde{X}_n^{(i)}). We then have an approximation of the prediction step

\displaystyle  \hat{p}(x_{1:n} \,|\, y_{1:n-1}) = \frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})

Substituting

\displaystyle  \begin{aligned} {\hat{p}(y_n \,|\, y_{1:n-1})} &= \int {g(y_n \,|\, x_n) \, \hat{p}(x_{1:n} \,|\, y_{1:n-1})} \,\mathrm{d}x_n \\ &= \int {g(y_n \,|\, x_n)}\frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n-1}}^{(i)}(x_{1:n}) \,\mathrm{d}x_n \\ &= \frac{1}{N}\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})} \end{aligned}

and again

\displaystyle  \begin{aligned} \tilde{p}(x_{1:n} \,|\, y_{1:n}) &= \frac{g(y_n \,|\, x_{n}) \, {\hat{p}(x_{1:n} \,|\, y_{1:n-1})}}      {\hat{p}(y_n \,|\, y_{1:n-1})} \\ &= \frac{g(y_n \,|\, x_{n}) \, \frac{1}{N}\sum_{i=1}^N \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})}      {\frac{1}{N}\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})}} \\ &= \frac{ \sum_{i=1}^N g(y_n \,|\, \tilde{X}_n^{(i)}) \, \delta_{\tilde{X}_{1:n}}^{(i)}(x_{1:n})}      {\sum_{i=1}^N {g(y_n \,|\, \tilde{X}_n^{(i)})}} \\ &= \sum_{i=1}^N W_n^{(i)} \delta_{\tilde{X}_{1:n}^{(i)}} (x_{1:n}) \end{aligned}

where W_n^{(i)} \propto g(y_n \,|\, \tilde{X}_n^{(i)}) and \sum_{i=1}^N W_n^{(i)} = 1.

Now sample

\displaystyle  X_{1:n}^{(i)} \sim \tilde{p}(x_{1:n} \,|\, y_{1:n})

A Haskell Implementation

Let’s specify some values for the example of the car moving in two dimensions.

> deltaT, sigma1, sigma2, qc1, qc2 :: Double
> deltaT = 0.1
> sigma1 = 1/2
> sigma2 = 1/2
> qc1 = 1
> qc2 = 1
> bigA :: H.Matrix Double
> bigA = (4 H.>< 4) bigAl
> bigAl :: [Double]
> bigAl = [1, 0 , deltaT,      0,
>          0, 1,       0, deltaT,
>          0, 0,       1,      0,
>          0, 0,       0,      1]
> bigQ :: H.Herm Double
> bigQ = H.trustSym $ (4 H.>< 4) bigQl
> bigQl :: [Double]
> bigQl = [qc1 * deltaT^3 / 3,                  0, qc1 * deltaT^2 / 2,                  0,
>                           0, qc2 * deltaT^3 / 3,                  0, qc2 * deltaT^2 / 2,
>          qc1 * deltaT^2 / 2,                  0,       qc1 * deltaT,                  0,
>                           0, qc2 * deltaT^2 / 2,                  0,       qc2 * deltaT]
> bigH :: H.Matrix Double
> bigH = (2 H.>< 4) [1, 0, 0, 0,
>                    0, 1, 0, 0]
> bigR :: H.Herm Double
> bigR = H.trustSym $ (2 H.>< 2) [sigma1^2,        0,
>                                        0, sigma2^2]
> m0 :: H.Vector Double
> m0 = H.fromList [0, 0, 1, -1]
> bigP0 :: H.Herm Double
> bigP0 = H.trustSym $ H.ident 4
> n :: Int
> n = 23

With these we generate hidden and observable sample path.

> carSample :: MonadRandom m =>
>              H.Vector Double ->
>              m (Maybe ((H.Vector Double, H.Vector Double), H.Vector Double))
> carSample xPrev = do
>   xNew <- sample $ rvar (Normal (bigA H.#> xPrev) bigQ)
>   yNew <- sample $ rvar (Normal (bigH H.#> xNew) bigR)
>   return $ Just ((xNew, yNew), xNew)
> carSamples :: [(H.Vector Double, H.Vector Double)]
> carSamples = evalState (ML.unfoldrM carSample m0) (pureMT 17)

We can plot an example trajectory for the car and the noisy observations that are available to the smoother / filter.

Sadly there is no equivalent to numpy in Haskell. Random number packages generate vectors, for multi-rank arrays there is repa and for fast matrix manipulation there is hmtatrix. Thus for our single step path update function, we have to pass in functions to perform type conversion. Clearly with all the copying inherent in this approach, performance is not going to be great.

The type synonym ArraySmoothing is used to denote the cloud of particles at each time step.

> type ArraySmoothing = Repa.Array Repa.U DIM2
> singleStep :: forall a . U.Unbox a =>
>               (a -> H.Vector Double) ->
>               (H.Vector Double -> a) ->
>               H.Matrix Double ->
>               H.Herm Double ->
>               H.Matrix Double ->
>               H.Herm Double ->
>               ArraySmoothing a -> H.Vector Double ->
>               WriterT [ArraySmoothing a] (StateT PureMT IO) (ArraySmoothing a)
> singleStep f g bigA bigQ bigH bigR x y = do
>   tell[x]
>   let (Z :. ix :. jx) = extent x
> 
>   xHatR <- lift $ computeP $ Repa.slice x (Any :. jx - 1)
>   let xHatH = map f $ Repa.toList (xHatR  :: Repa.Array Repa.U DIM1 a)
>   xTildeNextH <- lift $ mapM (\x -> sample $ rvar (Normal (bigA H.#> x) bigQ)) xHatH
> 
>   let xTildeNextR = Repa.fromListUnboxed (Z :. ix :. (1 :: Int)) $
>                     map g xTildeNextH
>       xTilde = Repa.append x xTildeNextR
> 
>       weights = map (normalPdf y bigR) $
>                 map (bigH H.#>) xTildeNextH
>       vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList weights)
>       totWeight = sum weights
>       js = indices (V.map (/ totWeight) $ V.tail cumSumWeights) vs
>       xNewV = V.map (\j -> Repa.transpose $
>                            Repa.reshape (Z :. (1 :: Int) :. jx + 1) $
>                            slice xTilde (Any :. j :. All)) js
>       xNewR = Repa.transpose $ V.foldr Repa.append (xNewV V.! 0) (V.tail xNewV)
>   computeP xNewR

The state for the car is a 4-tuple.

> data SystemState a = SystemState { xPos  :: a
>                                  , yPos  :: a
>                                  , _xSpd :: a
>                                  , _ySpd :: a
>                                  }

We initialize the smoother from some prior distribution.

> initCar :: StateT PureMT IO (ArraySmoothing (SystemState Double))
> initCar = do
>   xTilde1 <- replicateM n $ sample $ rvar (Normal m0 bigP0)
>   let weights = map (normalPdf (snd $ head carSamples) bigR) $
>                 map (bigH H.#>) xTilde1
>       vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList weights)
>       js = indices (V.tail cumSumWeights) vs
>       xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int)) $
>               map ((\[a,b,c,d] -> SystemState a b c d) . H.toList) $
>               V.toList $
>               V.map ((V.fromList xTilde1) V.!) js
>   return xHat1

Now we can run the smoother.

> smootherCar :: StateT PureMT IO
>             (ArraySmoothing (SystemState Double)
>             , [ArraySmoothing (SystemState Double)])
> smootherCar = runWriterT $ do
>   xHat1 <- lift initCar
>   foldM (singleStep f g bigA bigQ bigH bigR) xHat1 (take 100 $ map snd $ tail carSamples)
> f :: SystemState Double -> H.Vector Double
> f (SystemState a b c d) = H.fromList [a, b, c, d]
> g :: H.Vector Double -> SystemState Double
> g = (\[a,b,c,d] -> (SystemState a b c d)) . H.toList

And create inferred positions for the car which we then plot.

> testCar :: IO ([Double], [Double])
> testCar = do
>   states <- snd <$> evalStateT smootherCar (pureMT 24)
>   let xs :: [Repa.Array Repa.D DIM2 Double]
>       xs = map (Repa.map xPos) states
>   sumXs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose xs)
>   let ixs = map extent sumXs
>       sumLastXs = map (* (recip $ fromIntegral n)) $
>                   zipWith (Repa.!) sumXs (map (\(Z :. x) -> Z :. (x - 1)) ixs)
>   let ys :: [Repa.Array Repa.D DIM2 Double]
>       ys = map (Repa.map yPos) states
>   sumYs :: [Repa.Array Repa.U DIM1 Double] <- mapM Repa.sumP (map Repa.transpose ys)
>   let ixsY = map extent sumYs
>       sumLastYs = map (* (recip $ fromIntegral n)) $
>                   zipWith (Repa.!) sumYs (map (\(Z :. x) -> Z :. (x - 1)) ixsY)
>   return (sumLastXs, sumLastYs)

So it seems our smoother does quite well at inferring the state at the latest observation, that is, when it is working as a filter. But what about estimates for earlier times? We should do better as we have observations in the past and in the future. Let’s try with a simpler example and a smaller number of particles.

First we create some samples for our simple 1 dimensional linear Gaussian model.

> bigA1, bigQ1, bigR1, bigH1 :: Double
> bigA1 = 0.5
> bigQ1 = 0.1
> bigR1 = 0.1
> bigH1 = 1.0
> simpleSample :: MonadRandom m =>
>               Double ->
>               m (Maybe ((Double, Double), Double))
> simpleSample xPrev = do
>   xNew <- sample $ rvar (R.Normal (bigA1 * xPrev) bigQ1)
>   yNew <- sample $ rvar (R.Normal (bigH1 * xNew) bigR1)
>   return $ Just ((xNew, yNew), xNew)
> simpleSamples :: [(Double, Double)]
> simpleSamples = evalState (ML.unfoldrM simpleSample 0.0) (pureMT 17)

Again create a prior.

> initSimple :: MonadRandom m => m (ArraySmoothing Double)
> initSimple = do
>   let y = snd $ head simpleSamples
>   xTilde1 <- replicateM n $ sample $ rvar $ R.Normal y bigR1
>   let weights = map (pdf (R.Normal y bigR1)) $
>                 map (bigH1 *) xTilde1
>       totWeight = sum weights
>       vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
>       cumSumWeights = V.scanl (+) 0 (V.fromList $ map (/ totWeight) weights)
>       js = indices (V.tail cumSumWeights) vs
>       xHat1 = Repa.fromListUnboxed (Z :. n :. (1 :: Int)) $
>               V.toList $
>               V.map ((V.fromList xTilde1) V.!) js
>   return xHat1

Now we can run the smoother.

> smootherSimple :: StateT PureMT IO (ArraySmoothing Double, [ArraySmoothing Double])
> smootherSimple = runWriterT $ do
>   xHat1 <- lift initSimple
>   foldM (singleStep f1 g1 ((1 H.>< 1) [bigA1]) (H.trustSym $ (1 H.>< 1) [bigQ1^2])
>                           ((1 H.>< 1) [bigH1]) (H.trustSym $ (1 H.>< 1) [bigR1^2]))
>         xHat1
>         (take 20 $ map H.fromList $ map return . map snd $ tail simpleSamples)
> f1 :: Double -> H.Vector Double
> f1 a = H.fromList [a]
> g1 :: H.Vector Double -> Double
> g1 = (\[a] -> a) . H.toList

And finally we can look at the paths not just the means of the marginal distributions at the latest observation time.

> testSimple :: IO [[Double]]
> testSimple = do
>   states <- snd <$> evalStateT smootherSimple (pureMT 24)
>   let path :: Int -> IO (Repa.Array Repa.U DIM1 Double)
>       path i = computeP $ Repa.slice (last states) (Any :. i :. All)
>   paths <- mapM path [0..n - 1]
>   return $ map Repa.toList paths

We can see that at some point in the past all the current particles have one ancestor. The marginals of the smoothing distribution (at some point in the past) have collapsed on to one particle.

Notes

Helpers for the Inverse CDF

That these are helpers for the inverse CDF is delayed to another blog post.

> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs
> binarySearch :: Ord a =>
>                 V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
>   where
>     loop !l !u
>       | u <= l    = l
>       | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
>       where k = l + (u - l) `shiftR` 1

Multivariate Normal

The random-fu package does not contain a sampler or pdf for a multivariate normal so we create our own. This should be added to random-fu-multivariate package or something similar.

> normalMultivariate :: H.Vector Double -> H.Herm Double -> RVarT m (H.Vector Double)
> normalMultivariate mu bigSigma = do
>   z <- replicateM (H.size mu) (rvarT R.StdNormal)
>   return $ mu + bigA H.#> (H.fromList z)
>   where
>     (vals, bigU) = H.eigSH bigSigma
>     lSqrt = H.diag $ H.cmap sqrt vals
>     bigA = bigU H.<> lSqrt
> data family Normal k :: *
> data instance Normal (H.Vector Double) = Normal (H.Vector Double) (H.Herm Double)
> instance Distribution Normal (H.Vector Double) where
>   rvar (Normal m s) = normalMultivariate m s
> normalPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) =>
>              H.Vector a -> H.Herm a -> H.Vector a -> a
> normalPdf mu sigma x = exp $ normalLogPdf mu sigma x
> normalLogPdf :: (H.Numeric a, H.Field a, H.Indexable (H.Vector a) a, Num (H.Vector a)) =>
>                  H.Vector a -> H.Herm a -> H.Vector a -> a
> normalLogPdf mu bigSigma x = - H.sumElements (H.cmap log (diagonals dec))
>                               - 0.5 * (fromIntegral (H.size mu)) * log (2 * pi)
>                               - 0.5 * s
>   where
>     dec = fromJust $ H.mbChol bigSigma
>     t = fromJust $ H.linearSolve (H.tr dec) (H.asColumn $ x - mu)
>     u = H.cmap (\x -> x * x) t
>     s = H.sumElements u
> diagonals :: (Storable a, H.Element t, H.Indexable (H.Vector t) a) =>
>              H.Matrix t -> H.Vector a
> diagonals m = H.fromList (map (\i -> m H.! i H.! i) [0..n-1])
>   where
>     n = max (H.rows m) (H.cols m)
> instance PDF Normal (H.Vector Double) where
>   pdf (Normal m s) = normalPdf m s
>   logPdf (Normal m s) = normalLogPdf m s

Misellaneous

> derivingUnbox "SystemState"
>     [t| forall a . (U.Unbox a) => SystemState a -> (a, a, a, a) |]
>     [| \ (SystemState x y xdot ydot) -> (x, y, xdot, ydot) |]
>     [| \ (x, y, xdot, ydot) -> SystemState x y xdot ydot |]
> instance Pretty a => Pretty (SystemState a) where
>   pPrint (SystemState x y xdot ydot ) = pPrint (x, y, xdot, ydot)

Bibliography

Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. New York, NY, USA: Cambridge University Press.

Population Growth Estimation via Hamiltonian Monte Carlo

Here’s the same analysis of estimating population growth using Stan.

data {
  int<lower=0> N; // number of observations
  vector[N] y;    // observed population
}

parameters {
  real r;
}

model {
  real k;
  real p0;
  real deltaT;
  real sigma;
  real mu0;
  real sigma0;
  vector[N] p;
  k      <- 1.0;
  p0     <- 0.1;
  deltaT <- 0.0005;
  sigma  <- 0.01;
  mu0    <- 5;
  sigma0 <- 10;

  r ~ normal(mu0, sigma0);

  for (n in 1:N) {
    p[n] <- k * p0 * exp((n - 1) * r * deltaT) / (k + p0 * (exp((n - 1) * r * deltaT) - 1));
    y[n] ~ normal(p[n], sigma);
  }
}

Empirically, by looking at the posterior, this seems to do a better job than either extended Kalman or vanilla Metropolis.

The Flow of the Thames: An Autoregressive Model

Thames Flux

It is roughly 150 miles from the source of the Thames to Kingston Bridge. If we assume that it flows at about 2 miles per hour then the water at Thames Head will have reached Kingston very roughly at \frac{150}{24\times 2} \approxeq 3 days.

The Environmental Agency measure the flux at Kingston Bridge on a twice daily basis. Can we predict this? In the first instance without any other data and using our observation that Thames flushes itself every 3 days, let us try

\displaystyle   X_t = \theta_1 X_{t-1} + \theta_2 X_{t-2} + \theta_3 X_{t-3} + \epsilon_t

where X_t is the flux on day t and \{\epsilon_t\}_{t \in \mathbb{N}} are independent normal errors with mean 0 and variance some given value \sigma^2.

Kalman

As it stands, our model is not Markov so we cannot directly apply techniques such as Kalman filtering or particle filtering to estimate the parameters. However we can re-write the model as

\displaystyle   \begin{bmatrix}  \theta_1^{(t)} \\  \theta_2^{(t)} \\  \theta_3^{(t)}  \end{bmatrix} =  \begin{bmatrix}  1 & 0 & 0 \\  0 & 1 & 0 \\  0 & 0 & 1  \end{bmatrix}  \begin{bmatrix}  \theta_1^{(t-1)} \\  \theta_2^{(t-1)} \\  \theta_3^{(t-1)}  \end{bmatrix} +  \begin{bmatrix}  \eta_{t} \\  \eta_{t} \\  \eta_{t}  \end{bmatrix}

\displaystyle   y_t = \begin{bmatrix}        x_{t-1} & x_{t-2} & x_{t-3}        \end{bmatrix}  \begin{bmatrix}  \theta_1^{(t)} \\  \theta_2^{(t)} \\  \theta_3^{(t)}  \end{bmatrix} +  \epsilon_{t}

Note that the observation map now varies over time so we have modify our Kalman filter implementation to accept a different matrix at each step.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> {-# LANGUAGE TypeOperators                #-}
> {-# LANGUAGE TypeFamilies                 #-}
> module Autoregression (
>     predictions
>   ) where
> import GHC.TypeLits
> import Numeric.LinearAlgebra.Static
> import Data.Maybe ( fromJust )
> import qualified Data.Vector as V
> inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n
> inv m = fromJust $ linSolve m eye
> outer ::  forall m n . (KnownNat m, KnownNat n,
>                         (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
>           R n -> Sq n -> [L m n] -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer muPrior sigmaPrior bigHs bigSigmaY bigA bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) (zip ys bigHs)
> 
>     update :: (R n, Sq n) -> (R m, L m n) -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) (y, bigH) =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>         v :: R m
>         v = y - bigH #> xHatFlat
>         bigS :: Sq m
>         bigS = bigH  bigSigmaHatFlat  (tr bigH) + bigSigmaY
>         bigK :: L n m
>         bigK = bigSigmaHatFlat  (tr bigH)  (inv bigS)
>         xHat :: R n
>         xHat = xHatFlat + bigK #> v
>         bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK  bigS  (tr bigK)
>         xHatFlatNew :: R n
>         xHatFlatNew = bigA #> xHat
>         bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA  bigSigmaHat  (tr bigA) + bigSigmaX

We can now set up the parameters to run the filter.

> stateVariance :: Double
> stateVariance = 1e-8
> bigSigmaX :: Sq 3
> bigSigmaX = fromList [ stateVariance, 0.0,           0.0
>                      , 0.0,           stateVariance, 0.0
>                      , 0.0,           0.0,           stateVariance
>                      ]
> bigA :: Sq 3
> bigA = eye
> muPrior :: R 3
> muPrior = fromList [0.0, 0.0, 0.0]
> sigmaPrior :: Sq 3
> sigmaPrior = fromList [ 1e1, 0.0, 0.0
>                       , 0.0, 1e1, 0.0
>                       , 0.0, 0.0, 1e1
>                       ]
> bigHsBuilder :: V.Vector Double -> [L 1 3]
> bigHsBuilder flows =
>   V.toList $
>   V.zipWith3 (\x0 x1 x2 -> fromList [x0, x1, x2])
>   (V.tail flows) (V.tail $ V.tail flows) (V.tail $ V.tail $ V.tail flows)
> obsVariance :: Double
> obsVariance = 1.0e-2
> bigSigmaY :: Sq 1
> bigSigmaY = fromList [ obsVariance ]
> predict :: R 3 -> Double -> Double -> Double -> Double
> predict theta f1 f2 f3 = h1 * f1 + h2 * f2 + h3 * f3
>   where
>     (h1, t1) = headTail theta
>     (h2, t2) = headTail t1
>     (h3, _)  = headTail t2
> thetas :: V.Vector Double -> [(R 3, Sq 3)]
> thetas flows = outer muPrior sigmaPrior (bigHsBuilder flows)
>                bigSigmaY bigA bigSigmaX (map (fromList . return) (V.toList flows))
> predictions :: V.Vector Double -> V.Vector Double
> predictions flows =
>   V.zipWith4 predict
>   (V.fromList $ map fst (thetas flows))
>   flows (V.tail flows) (V.tail $ V.tail flows)

How Good is Our Model?

If we assume that parameters are essentially fixed by taking the state variance to be e.g. 10^{-8} then the fit is not good.

However, if we assume the parameters to undergo Brownian motion by taking the state variance to be e.g. 10^{-2} then we get a much better fit. Of course, Brownian motion is probably not a good way of modelling the parameters; we hardly expect that these could wander off to infinity.

Fun with (Extended Kalman) Filters

Summary

An extended Kalman filter in Haskell using type level literals and automatic differentiation to provide some guarantees of correctness.

Population Growth

Suppose we wish to model population growth of bees via the logistic equation

\displaystyle  \begin{aligned}  \dot{p} & = rp\Big(1 - \frac{p}{k}\Big)  \end{aligned}

We assume the growth rate r is unknown and drawn from a normal distribution {\cal{N}}(\mu_r, \sigma_r^2) but the carrying capacity k is known and we wish to estimate the growth rate by observing noisy values y_i of the population at discrete times t_0 = 0, t_1 = \Delta T, t_2 = 2\Delta T, \ldots. Note that p_t is entirely deterministic and its stochasticity is only as a result of the fact that the unknown parameter of the logistic equation is sampled from a normal distribution (we could for example be observing different colonies of bees and we know from the literature that bee populations obey the logistic equation and each colony will have different growth rates).

Haskell Preamble

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> {-# LANGUAGE BangPatterns                 #-}
> {-# LANGUAGE TypeOperators                #-}
> {-# LANGUAGE TypeFamilies                 #-}
> module FunWithKalman3 where
> import GHC.TypeLits
> import Numeric.LinearAlgebra.Static
> import Data.Maybe ( fromJust )
> import Numeric.AD
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> import Control.Monad.Loops

Logistic Equation

The logistic equation is a well known example of a dynamical system which has an analytic solution

\displaystyle  p = \frac{kp_0\exp rt}{k + p_0(\exp rt - 1)}

Here it is in Haskell

> logit :: Floating a => a -> a -> a -> a
> logit p0 k x = k * p0 * (exp x) / (k + p0 * (exp x - 1))

We observe a noisy value of population at regular time intervals (where \Delta T is the time interval)

\displaystyle  \begin{aligned}  p_i &= \frac{kp_0\exp r\Delta T i}{k + p_0(\exp r\Delta T i - 1)} \\  y_i &= p_i + \epsilon_i  \end{aligned}

Using the semi-group property of our dynamical system, we can re-write this as

\displaystyle  \begin{aligned}  p_i &= \frac{kp_{i-1}\exp r\Delta T}{k + p_{i-1}(\exp r\Delta T - 1)} \\  y_i &= p_i + \epsilon_i  \end{aligned}

To convince yourself that this re-formulation is correct, think of the population as starting at p_0; after 1 time step it has reached p_1 and and after two time steps it has reached p_2 and this ought to be the same as the point reached after 1 time step starting at p_1, for example

> oneStepFrom0, twoStepsFrom0, oneStepFrom1 :: Double
> oneStepFrom0  = logit 0.1 1.0 (1 * 0.1)
> twoStepsFrom0 = logit 0.1 1.0 (1 * 0.2)
> oneStepFrom1  = logit oneStepFrom0 1.0 (1 * 0.1)
ghci> twoStepsFrom0
  0.11949463171139338

ghci> oneStepFrom1
  0.1194946317113934

We would like to infer the growth rate not just be able to predict the population so we need to add another variable to our model.

\displaystyle  \begin{aligned}  r_i &= r_{i-1} \\  p_i &= \frac{kp_{i-1}\exp r_{i-1}\Delta T}{k + p_{i-1}(\exp r_{i-1}\Delta T - 1)} \\  y_i &= \begin{bmatrix}0 & 1\end{bmatrix}\begin{bmatrix}r_i \\ p_i\end{bmatrix} + \begin{bmatrix}0 \\ \epsilon_i\end{bmatrix}  \end{aligned}

Extended Kalman

This is almost in the form suitable for estimation using a Kalman filter but the dependency of the state on the previous state is non-linear. We can modify the Kalman filter to create the extended Kalman filter (EKF) by making a linear approximation.

Since the measurement update is trivially linear (even in this more general form), the measurement update step remains unchanged.

\displaystyle  \begin{aligned}  \boldsymbol{v}_i & \triangleq  \boldsymbol{y}_i - \boldsymbol{g}(\hat{\boldsymbol{x}}^\flat_i) \\  \boldsymbol{S}_i & \triangleq  \boldsymbol{G}_i \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{G}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\  \boldsymbol{K}_i & \triangleq \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{G}_i^\top\boldsymbol{S}^{-1}_i \\  \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \boldsymbol{K}_i\boldsymbol{v}_i \\  \hat{\boldsymbol{\Sigma}}_i &\triangleq \hat{\boldsymbol{\Sigma}}^\flat_i - \boldsymbol{K}_i\boldsymbol{S}_i\boldsymbol{K}^\top_i  \end{aligned}

By Taylor we have

\displaystyle  \boldsymbol{a}(\boldsymbol{x}) \approx \boldsymbol{a}(\boldsymbol{m}) + \boldsymbol{A}_{\boldsymbol{x}}(\boldsymbol{m})\delta\boldsymbol{x}

where \boldsymbol{A}_{\boldsymbol{x}}(\boldsymbol{m}) is the Jacobian of \boldsymbol{a} evaluated at \boldsymbol{m} (for the exposition of the extended filter we take \boldsymbol{a} to be vector valued hence the use of a bold font). We take \delta\boldsymbol{x} to be normally distributed with mean of 0 and ignore any difficulties there may be with using Taylor with stochastic variables.

Applying this at \boldsymbol{m} = \hat{\boldsymbol{x}}_{i-1} we have

\displaystyle  \boldsymbol{x}_i = \boldsymbol{a}(\hat{\boldsymbol{x}}_{i-1}) + \boldsymbol{A}_{\boldsymbol{x}}(\hat{\boldsymbol{x}}_{i-1})(\boldsymbol{x}_{i-1} - \hat{\boldsymbol{x}}_{i-1}) + \boldsymbol{\epsilon}_i

Using the same reasoning as we did as for Kalman filters and writing \boldsymbol{A}_{i-1} for \boldsymbol{A}_{\boldsymbol{x}}(\hat{\boldsymbol{x}}_{i-1}) we obtain

\displaystyle  \begin{aligned}  \hat{\boldsymbol{x}}^\flat_i &=  \boldsymbol{a}(\hat{\boldsymbol{x}}_{i-1}) \\  \hat{\boldsymbol{\Sigma}}^\flat_i &= \boldsymbol{A}_{i-1}  \hat{\boldsymbol{\Sigma}}_{i-1}  \boldsymbol{A}_{i-1}^\top  + \boldsymbol{\Sigma}^{(x)}_{i-1}  \end{aligned}

Haskell Implementation

Note that we pass in the Jacobian of the update function as a function itself in the case of the extended Kalman filter rather than the matrix representing the linear function as we do in the case of the classical Kalman filter.

> k, p0 :: Floating a => a
> k = 1.0
> p0 = 0.1 * k
> r, deltaT :: Floating a => a
> r = 10.0
> deltaT = 0.0005

Relating ad and hmatrix is somewhat unpleasant but this can probably be ameliorated by defining a suitable datatype.

> a :: R 2 -> R 2
> a rpPrev = rNew # pNew
>   where
>     (r, pPrev) = headTail rpPrev
>     rNew :: R 1
>     rNew = konst r
> 
>     (p,  _) = headTail pPrev
>     pNew :: R 1
>     pNew = fromList $ [logit p k (r * deltaT)]
> bigA :: R 2 -> Sq 2
> bigA rp = fromList $ concat $ j [r, p]
>   where
>     (r, ps) = headTail rp
>     (p,  _) = headTail ps
>     j = jacobian (\[r, p] -> [r, logit p k (r * deltaT)])

For some reason, hmatrix with static guarantees does not yet provide an inverse function for matrices.

> inv :: (KnownNat n, (1 <=? n) ~ 'True) => Sq n -> Sq n
> inv m = fromJust $ linSolve m eye

Here is the extended Kalman filter itself. The type signatures on the expressions inside the function are not necessary but did help the implementor discover a bug in the mathematical derivation and will hopefully help the reader.

> outer ::  forall m n . (KnownNat m, KnownNat n,
>                         (1 <=? n) ~ 'True, (1 <=? m) ~ 'True) =>
>           R n -> Sq n ->
>           L m n -> Sq m ->
>           (R n -> R n) -> (R n -> Sq n) -> Sq n ->
>           [R m] ->
>           [(R n, Sq n)]
> outer muPrior sigmaPrior bigH bigSigmaY
>       littleA bigABuilder bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) ys
> 
>     update :: (R n, Sq n) -> R m -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) y =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
> 
>         v :: R m
>         v = y - (bigH #> xHatFlat)
> 
>         bigS :: Sq m
>         bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY
> 
>         bigK :: L n m
>         bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS)
> 
>         xHat :: R n
>         xHat = xHatFlat + bigK #> v
> 
>         bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK)
> 
>         bigA :: Sq n
>         bigA = bigABuilder xHat
> 
>         xHatFlatNew :: R n
>         xHatFlatNew = littleA xHat
> 
>         bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX

Now let us create some sample data.

> obsVariance :: Double
> obsVariance = 1e-2
> bigSigmaY :: Sq 1
> bigSigmaY = fromList [obsVariance]
> nObs :: Int
> nObs = 300
> singleSample :: Double -> RVarT (W.Writer [Double]) Double
> singleSample p0 = do
>   epsilon <- rvarT (Normal 0.0 obsVariance)
>   let p1 = logit p0 k (r * deltaT)
>   lift $ W.tell [p1 + epsilon]
>   return p1
> streamSample :: RVarT (W.Writer [Double]) Double
> streamSample = iterateM_ singleSample p0
> samples :: [Double]
> samples = take nObs $ snd $
>           W.runWriter (evalStateT (sample streamSample) (pureMT 3))

We created our data with a growth rate of

ghci> r
  10.0

but let us pretend that we have read the literature on growth rates of bee colonies and we have some big doubts about growth rates but we are almost certain about the size of the colony at t=0.

> muPrior :: R 2
> muPrior = fromList [5.0, 0.1]
> 
> sigmaPrior :: Sq 2
> sigmaPrior = fromList [ 1e2, 0.0
>                       , 0.0, 1e-10
>                       ]

We only observe the population and not the rate itself.

> bigH :: L 1 2
> bigH = fromList [0.0, 1.0]

Strictly speaking this should be 0 but this is close enough.

> bigSigmaX :: Sq 2
> bigSigmaX = fromList [ 1e-10, 0.0
>                      , 0.0, 1e-10
>                      ]

Now we can run our filter and watch it switch away from our prior belief as it accumulates more and more evidence.

> test :: [(R 2, Sq 2)]
> test = outer muPrior sigmaPrior bigH bigSigmaY
>        a bigA bigSigmaX (map (fromList . return) samples)

Backpropogation is Just Steepest Descent with Automatic Differentiation

Preface

The intended audience of this article is someone who knows something about Machine Learning and Artifical Neural Networks (ANNs) in particular and who recalls that fitting an ANN required a technique called backpropagation. The goal of this post is to refresh the reader’s knowledge of ANNs and backpropagation and to show that the latter is merely a specialised version of automatic differentiation, a tool that all Machine Learning practitioners should know about and have in their toolkit.

Introduction

The problem is simple to state: we have a (highly) non-linear function, the cost function of an Artificial Neural Network (ANN), and we wish to minimize this so as to estimate the parameters / weights of the function.

In order to minimise the function, one obvious approach is to use steepest descent: start with random values for the parameters to be estimated, find the direction in which the the function decreases most quickly, step a small amount in that direction and repeat until close enough.

But we have two problems:

  • We have an algorithm or a computer program that calculates the non-linear function rather than the function itself.

  • The function has a very large number of parameters, hundreds if not thousands.

One thing we could try is bumping each parameter by a small amount to get partial derivatives numerically

\displaystyle   \frac{\partial E(\ldots, w, \ldots)}{\partial w} \approx \frac{E(\ldots, w + \epsilon, \ldots) - E(\ldots, w, \ldots)}{\epsilon}

But this would mean evaluating our function many times and moreover we could easily get numerical errors as a result of the vagaries of floating point arithmetic.

As an alternative we could turn our algorithm or computer program into a function more recognisable as a mathematical function and then compute the differential itself as a function either by hand or by using a symbolic differentiation package. For the complicated expression which is our mathematical function, the former would be error prone and the latter could easily generate something which would be even more complex and costly to evaluate than the original expression.

The standard approach is to use a technique called backpropagation and the understanding and application of this technique forms a large part of many machine learning lecture courses.

Since at least the 1960s techniques for automatically differentiating computer programs have been discovered and re-discovered. Anyone who knows about these techniques and reads about backpropagation quickly realises that backpropagation is just automatic differentiation and steepest descent.

This article is divided into

  • Refresher on neural networks and backpropagation;

  • Methods for differentiation;

  • Backward and forward automatic differentiation and

  • Concluding thoughts.

The only thing important to remember throughout is the chain rule

\displaystyle   (g \circ f)'(a) = g'(f(a))\cdot f'(a)

in alternative notation

\displaystyle   \frac{\mathrm{d} (g \circ f)}{\mathrm{d} x}(a) =  \frac{\mathrm{d} g}{\mathrm{d} y}(f(a)) \frac{\mathrm{d} f}{\mathrm{d} x}(a)

where y = f(x). More suggestively we can write

\displaystyle   \frac{\mathrm{d} g}{\mathrm{d} x} =  \frac{\mathrm{d} g}{\mathrm{d} y} \frac{\mathrm{d} y}{\mathrm{d} x}

where it is understood that \mathrm{d} g / \mathrm{d} x and \mathrm{d} y / \mathrm{d} x are evaluated at a and \mathrm{d} g / \mathrm{d} y is evaluated at f(a).

For example,

\displaystyle   \frac{\mathrm{d}}{\mathrm{d} x} \sqrt{3 \sin(x)} =  \frac{\mathrm{d}}{\mathrm{d} x} (3 \sin(x)) \cdot \frac{\mathrm{d}}{\mathrm{d} y} \sqrt{y} =  3 \cos(x) \cdot \frac{1}{2\sqrt{y}} =  \frac{3\cos(x)}{2\sqrt{3\sin(x)}}

Acknowledgements

Sadly I cannot recall all the sources I looked at in order to produce this article but I have made heavy use of the following.

Neural Network Refresher

Here is our model, with \boldsymbol{x} the input, \hat{\boldsymbol{y}} the predicted output and \boldsymbol{y} the actual output and w^{(k)} the weights in the k-th layer. We have concretised the transfer function as \tanh but it is quite popular to use the \text{logit} function.

\displaystyle   \begin{aligned}  a_i^{(1)} &= \sum_{j=0}^{N^{(1)}} w_{ij}^{(1)} x_j \\  z_i^{(1)} &= \tanh(a_i^{(1)}) \\  a_i^{(2)} &= \sum_{j=0}^{N^{(2)}} w_{ij}^{(2)} z_j^{(1)} \\  \dots     &= \ldots \\  a_i^{(L-1)} &= \sum_{j=0}^{N^{(L-1)}} w_{ij}^{(L-1)} z_j^{(L-2)} \\  z_j^{(L-1)} &= \tanh(a_j^{(L-1)}) \\  \hat{y}_i &= \sum_{j=0}^{N^{(L)}} w_{ij}^{(L)} z_j^{(L-1)} \\  \end{aligned}

with the loss or cost function

\displaystyle   E(\boldsymbol{w}; \boldsymbol{x}, \boldsymbol{y}) = \frac{1}{2}\|(\hat{\boldsymbol{y}} - \boldsymbol{y})\|^2

The diagram below depicts a neural network with a single hidden layer.

In order to apply the steepest descent algorithm we need to calculate the differentials of this latter function with respect to the weights, that is, we need to calculate

\displaystyle   \Delta w_{ij} = \frac{\partial E}{\partial w_{ij}}

Applying the chain rule

\displaystyle   \Delta w_{ij} =  \frac{\partial E}{\partial w_{ij}} =  \frac{\partial E}{\partial a_i}\frac{\partial a_i}{\partial w_{ij}}

Since

\displaystyle   a_j^{(l)} = \sum_{i=0}^N w_{ij}^{(l)}z_i^{(l-1)}

we have

\displaystyle   \frac{\partial a_i^{(l)}}{\partial w_{ij}^{(l)}} =  \frac{\sum_{k=0}^M w_{kj}^{(l)}z_k^{(l-1)}}{\partial w_{ij}^{(l)}} =  z_i^{(l-1)}

Defining

\displaystyle   \delta_j^{(l)} \equiv  \frac{\partial E}{\partial a_j^{(l)}}

we obtain

\displaystyle   \Delta w_{ij}^{(l)} =  \frac{\partial E}{\partial w_{ij}^{(l)}} =  \delta_j^{(l)} z_i^{(l-1)}

Finding the z_i for each layer is straightforward: we start with the inputs and propagate forward. In order to find the \delta_j we need to start with the outputs a propagate backwards:

For the output layer we have (since \hat{y}_j = a_j)

\displaystyle   \delta_j = \frac{\partial E}{\partial a_j} = \frac{\partial E}{\partial y_j} = \frac{\partial}{\partial y_j}\bigg(\frac{1}{2}\sum_{i=0}^M (\hat{y}_i - y_i)^2\bigg) = \hat{y}_j - y_j

For a hidden layer using the chain rule

\displaystyle   \delta_j^{(l-1)} = \frac{\partial E}{\partial a_j^{(l-1)}} =  \sum_k \frac{\partial E}{\partial a_k^{(l)}}\frac{\partial a_k^{(l)}}{\partial a_j^{(l-1)}}

Now

\displaystyle   a_k^{(l)} = \sum_i w_{ki}^{(l)}z_i^{(l-1)} = \sum_i w_{ki}^{(l)} f(a_i^{(l-1)})

so that

\displaystyle   \frac{\partial a_k^{(l)}}{\partial a_j^{(l-1)}} =  \frac{\sum_i w_{ki}^{(l)} f(a_i^{(l-1)})}{\partial a_j^{(l-1)}} =  w_{kj}^{(l)}\,f'(a_j^{(l-1)})

and thus

\displaystyle   \delta_j^{(l-1)} =  \sum_k \frac{\partial E}{\partial a_k^{(l)}}\frac{\partial a_k^{(l)}}{\partial a_j^{(l-1)}} =  \sum_k \delta_k^{(l)} w_{kj}^{(l)}\, f'(a_j^{(l-1)}) =  f'(a_j^{(l-1)}) \sum_k \delta_k^{(l)} w_{kj}^{(l)}

Summarising

  1. We calculate all a_j and z_j for each layer starting with the input layer and propagating forward.

  2. We evaluate \delta_j^{(L)} in the output layer using \delta_j = \hat{y}_j - y_j.

  3. We evaluate \delta_j in each layer using \delta_j^{(l-1)} = f'(a_j^{(l-1)})\sum_k \delta_k^{(l)} w_{kj}^{(l)} starting with the output layer and propagating backwards.

  4. Use \partial E / \partial w_{ij}^{(l)} = \delta_j^{(l)} z_i^{(l-1)} to obtain the required derivatives in each layer.

For the particular activation function \tanh we have f'(a) = \tanh' (a) = 1 - \tanh^2(a). And finally we can use the partial derivatives to step in the right direction using steepest descent

\displaystyle   w' = w - \gamma\nabla E(w)

where \gamma is the step size aka the learning rate.

Differentiation

So now we have an efficient algorithm for differentiating the cost function for an ANN and thus estimating its parameters but it seems quite complex. In the introduction we alluded to other methods of differentiation. Let us examine those in a bit more detail before moving on to a general technique for differentiating programs of which backpropagation turns out to be a specialisation.

Numerical Differentiation

Consider the function f(x) = e^x then its differential f'(x) = e^x and we can easily compare a numerical approximation of this with the exact result. The numeric approximation is given by

\displaystyle   f'(x) \approx \frac{f(x + \epsilon) - f(x)}{\epsilon}

In theory we should get a closer and closer approximation as epsilon decreases but as the chart below shows at some point (with \epsilon \approx 2^{-26}) the approximation worsens as a result of the fact that we are using floating point arithmetic. For a complex function such as one which calculates the cost function of an ANN, there is a risk that we may end up getting a poor approximation for the derivative and thus a poor estimate for the parameters of the model.

Symbolic Differentiation

Suppose we have the following program (written in Python)

import numpy as np

def many_sines(x):
    y = x
    for i in range(1,7):
        y = np.sin(x+y)
    return y

When we unroll the loop we are actually evaluating

\displaystyle   f(x) = \sin(x + \sin(x + \sin(x + \sin(x + \sin(x + \sin(x + x))))))

Now suppose we want to get the differential of this function. Symbolically this would be

\displaystyle   \begin{aligned}  f'(x) &=           (((((2\cdot \cos(2x)+1)\cdot \\        &\phantom{=} \cos(\sin(2x)+x)+1)\cdot \\        &\phantom{=} \cos(\sin(\sin(2x)+x)+x)+1)\cdot \\        &\phantom{=} \cos(\sin(\sin(\sin(2x)+x)+x)+x)+1)\cdot \\        &\phantom{=} \cos(\sin(\sin(\sin(\sin(2x)+x)+x)+x)+x)+1)\cdot \\        &\phantom{=} \cos(\sin(\sin(\sin(\sin(\sin(2x)+x)+x)+x)+x)+x)  \end{aligned}

Typically the non-linear function that an ANN gives is much more complex than the simple function given above. Thus its derivative will correspondingly more complex and therefore expensive to compute. Moreover calculating this derivative by hand could easily introduce errors. And in order to have a computer perform the symbolic calculation we would have to encode our cost function somehow so that it is amenable to this form of manipulation.

Automatic Differentiation

Reverse Mode

Traditionally, forward mode is introduced first as this is considered easier to understand. We introduce reverse mode first as it can be seen to be a generalization of backpropagation.

Consider the function

\displaystyle   f(x) = \exp(\exp(x) + (\exp(x))^2) + \sin(\exp(x) + (\exp(x))^2)

Let us write this a data flow graph.

We can thus re-write our function as a sequence of simpler functions in which each function only depends on variables earlier in the sequence.

\displaystyle   \begin{aligned}  u_7    &= f_7(u_6, u_5, u_4, u_3, u_2, u_1) \\  u_6    &= f_6(u_5, u_4, u_3, u_2, u_1) \\  \ldots &= \ldots \\  u_1    &= f_1(u_1)  \end{aligned}

\displaystyle   \begin{aligned}  \mathrm{d}u_7    &= \frac{\partial f_7}{\partial u_6} \mathrm{d} u_6 +                      \frac{\partial f_7}{\partial u_5} \mathrm{d} u_5 +                      \frac{\partial f_7}{\partial u_4} \mathrm{d} u_4 +                      \frac{\partial f_7}{\partial u_3} \mathrm{d} u_3 +                      \frac{\partial f_7}{\partial u_2} \mathrm{d} u_2 +                      \frac{\partial f_7}{\partial u_1} \mathrm{d} u_1 \\  \mathrm{d}u_6    &= \frac{\partial f_6}{\partial u_5} \mathrm{d} u_5 +                      \frac{\partial f_6}{\partial u_4} \mathrm{d} u_4 +                      \frac{\partial f_6}{\partial u_3} \mathrm{d} u_3 +                      \frac{\partial f_6}{\partial u_2} \mathrm{d} u_2 +                      \frac{\partial f_6}{\partial u_1} \mathrm{d} u_1 \\  \ldots           &= \ldots \\  \mathrm{d}u_1    &= \frac{\partial f_1}{\partial u_1} \mathrm{d} u_1  \end{aligned}

In our particular example, since u_1, \dots, u_5 do not depend on u_6

\displaystyle   \begin{aligned}  \frac{\mathrm{d}u_7}{\mathrm{d}u_6} &= 1  \end{aligned}

Further u_6 does not depend on u_5 so we also have

\displaystyle   \begin{aligned}  \frac{\mathrm{d}u_7}{\mathrm{d}u_5} &= 1 \\  \end{aligned}

Now things become more interesting as u_6 and u_5 both depend on u_4 and so the chain rule makes an explicit appearance

\displaystyle   \begin{aligned}  \frac{\mathrm{d}u_7}{\mathrm{d}u_4} &=   \frac{\mathrm{d}u_7}{\mathrm{d}u_6}\frac{\mathrm{d}u_6}{\mathrm{d}u_4} +   \frac{\mathrm{d}u_7}{\mathrm{d}u_5}\frac{\mathrm{d}u_5}{\mathrm{d}u_4} \\  &= \frac{\mathrm{d}u_7}{\mathrm{d}u_6}\exp{u_4} +     \frac{\mathrm{d}u_7}{\mathrm{d}u_5}\cos{u_5}  \end{aligned}

Carrying on

\displaystyle   \begin{aligned}  \frac{\mathrm{d}u_7}{\mathrm{d}u_3} &=   \frac{\mathrm{d}u_7}{\mathrm{d}u_4}\frac{\mathrm{d}u_4}{\mathrm{d}u_3} \\  &= \frac{\mathrm{d}u_7}{\mathrm{d}u_4} \\  \frac{\mathrm{d}u_7}{\mathrm{d}u_2} &=   \frac{\mathrm{d}u_7}{\mathrm{d}u_4}\frac{\mathrm{d}u_4}{\mathrm{d}u_2} +   \frac{\mathrm{d}u_7}{\mathrm{d}u_3}\frac{\mathrm{d}u_3}{\mathrm{d}u_2} \\  &= \frac{\mathrm{d}u_7}{\mathrm{d}u_4} + 2u_2\frac{\mathrm{d}u_7}{\mathrm{d}u_4} \\  \frac{\mathrm{d}u_7}{\mathrm{d}u_1} &=   \frac{\mathrm{d}u_7}{\mathrm{d}u_2}\frac{\mathrm{d}u_2}{\mathrm{d}u_1} \\  &=\frac{\mathrm{d}u_7}{\mathrm{d}u_2}\exp{u_2}  \end{aligned}

Note that having worked from top to bottom (the forward sweep) in the graph to calculate the function itself, we have to work backwards from bottom to top (the backward sweep) to calculate the derivative.

So provided we can translate our program into a call graph, we can apply this procedure to calculate the differential with the same complexity as the original program.

The pictorial representation of an ANN is effectively the data flow graph of the cost function (without the final cost calculation itself) and its differential can be calculated as just being identical to backpropagation.

Forward Mode

An alternative method for automatic differentiation is called forward mode and has a simple implementation. Let us illustrate this using Haskell 98. The actual implementation is about 20 lines of code.

First some boilerplate declarations that need not concern us further.

> {-# LANGUAGE NoMonomorphismRestriction #-}
> 
> module AD (
>     Dual(..)
>   , f
>   , idD
>   , cost
>   , zs
>   ) where
> 
> default ()

Let us define dual numbers

> data Dual = Dual Double Double
>   deriving (Eq, Show)

We can think of these pairs as first order polynomials in the indeterminate \epsilon, x + \epsilon x' such that \epsilon^2 = 0

Thus, for example, we have

\displaystyle   \begin{aligned}  (x + \epsilon x') + (y + \epsilon y') &= ((x + y) + \epsilon (x' + y')) \\  (x + \epsilon x')(y + \epsilon y') &= xy + \epsilon (xy' + x'y) \\  \log (x + \epsilon x') &=  \log x (1 + \epsilon \frac {x'}{x}) =  \log x + \epsilon\frac{x'}{x} \\  \sqrt{(x + \epsilon x')} &=  \sqrt{x(1 + \epsilon\frac{x'}{x})} =  \sqrt{x}(1 + \epsilon\frac{1}{2}\frac{x'}{x}) =  \sqrt{x} + \epsilon\frac{1}{2}\frac{x'}{\sqrt{x}} \\  \ldots &= \ldots  \end{aligned}

Notice that these equations implicitly encode the chain rule. For example, we know, using the chain rule, that

\displaystyle   \frac{\mathrm{d}}{\mathrm{d} x}\log(\sqrt x) =  \frac{1}{\sqrt x}\frac{1}{2}x^{-1/2} =  \frac{1}{2x}

And using the example equations above we have

\displaystyle   \begin{aligned}  \log(\sqrt {x + \epsilon x'}) &= \log (\sqrt{x} + \epsilon\frac{1}{2}\frac{x'}{\sqrt{x}}) \\                                &= \log (\sqrt{x}) + \epsilon\frac{\frac{1}{2}\frac{x'}{\sqrt{x}}}{\sqrt{x}} \\                                &= \log (\sqrt{x}) + \epsilon x'\frac{1}{2x}  \end{aligned}

Notice that dual numbers carry around the calculation and the derivative of the calculation. To actually evaluate \log(\sqrt{x}) at a particular value, say 2, we plug in 2 for x and 1 for x'

\displaystyle   \log (\sqrt(2 + \epsilon 1) = \log(\sqrt{2}) + \epsilon\frac{1}{4}

Thus the derivative of \log(\sqrt{x}) at 2 is 1/4.

With a couple of helper functions we can implement this rule (\epsilon^2 = 0) by making Dual an instance of Num, Fractional and Floating.

> constD :: Double -> Dual
> constD x = Dual x 0
> 
> idD :: Double -> Dual
> idD x = Dual x 1.0

Let us implement the rules above by declaring Dual to be an instance of Num. A Haskell class such as Num simply states that it is possible to perform a (usually) related collection of operations on any type which is declared as an instance of that class. For example, Integer and Double are both types which are instances on Num and thus one can add, multiply, etc. values of these types (but note one cannot add an Integer to a Double without first converting a value of the former to a value of the latter).

As an aside, we will never need the functions signum and abs and declare them as undefined; in a robust implementation we would specify an error if they were ever accidentally used.

> instance Num Dual where
>   fromInteger n             = constD $ fromInteger n
>   (Dual x x') + (Dual y y') = Dual (x + y) (x' + y')
>   (Dual x x') * (Dual y y') = Dual (x * y) (x * y' + y * x')
>   negate (Dual x x')        = Dual (negate x) (negate x')
>   signum _                  = undefined
>   abs _                     = undefined

We need to be able to perform division on Dual so we further declare it to be an instance of Fractional.

> instance Fractional Dual where
>   fromRational p = constD $ fromRational p
>   recip (Dual x x') = Dual (1.0 / x) (-x' / (x * x))

We want to be able to perform the same operations on Dual as we can on Float and Double. Thus we make Dual an instance of Floating which means we can now operate on values of this type as though, in some sense, they are the same as values of Float or Double (in Haskell 98 only instances for Float and Double are defined for the class Floating).

> instance Floating Dual where
>   pi = constD pi
>   exp   (Dual x x') = Dual (exp x)   (x' * exp x)
>   log   (Dual x x') = Dual (log x)   (x' / x)
>   sqrt  (Dual x x') = Dual (sqrt x)  (x' / (2 * sqrt x))
>   sin   (Dual x x') = Dual (sin x)   (x' * cos x)
>   cos   (Dual x x') = Dual (cos x)   (x' * (- sin x))
>   sinh  (Dual x x') = Dual (sinh x)  (x' * cosh x)
>   cosh  (Dual x x') = Dual (cosh x)  (x' * sinh x)
>   asin  (Dual x x') = Dual (asin x)  (x' / sqrt (1 - x*x))
>   acos  (Dual x x') = Dual (acos x)  (x' / (-sqrt (1 - x*x)))
>   atan  (Dual x x') = Dual (atan x)  (x' / (1 + x*x))
>   asinh (Dual x x') = Dual (asinh x) (x' / sqrt (1 + x*x))
>   acosh (Dual x x') = Dual (acosh x) (x' / (sqrt (x*x - 1)))
>   atanh (Dual x x') = Dual (atanh x) (x' / (1 - x*x))

That’s all we need to do. Let us implement the function we considered earlier.

> f =  sqrt . (* 3) . sin

The compiler can infer its type

ghci> :t f
  f :: Floating c => c -> c

We know the derivative of the function and can also implement it directly in Haskell.

> f' x = 3 * cos x / (2 * sqrt (3 * sin x))

Now we can evaluate the function along with its automatically calculated derivative and compare that with the derivative we calculated symbolically by hand.

ghci> f $ idD 2
  Dual 1.6516332160855343 (-0.3779412091869595)

ghci> f' 2
  -0.3779412091869595

To see that we are not doing symbolic differentiation (it’s easy to see we are not doing numerical differentiation) let us step through the actual evaluation.

\displaystyle   \begin{aligned}  f\,\$\,\mathrm{idD}\,2 &\longrightarrow \mathrm{sqrt} \cdot \lambda x \rightarrow 3x \cdot \sin \$\,\mathrm{idD}\,2 \\  &\longrightarrow \mathrm{sqrt} \cdot \lambda x \rightarrow 3x \cdot \sin \$\,\mathrm{Dual}\,2\,1 \\  &\longrightarrow \mathrm{sqrt} \cdot \lambda x \rightarrow 3x\,\$\, \mathrm{Dual}\,\sin(2)\,\cos(2) \\  &\longrightarrow \mathrm{sqrt} \,\$\, \mathrm{Dual}\,3\,0 \times \mathrm{Dual}\,\sin(2)\,\cos(2) \\  &\longrightarrow \mathrm{sqrt} \,\$\, \mathrm{Dual}\,(3\sin(2))\, (3\cos(2) + 0\sin(2)) \\  &\longrightarrow \mathrm{sqrt} \,\$\, \mathrm{Dual}\,(3\sin(2))\, (3\cos(2)) \\  &\longrightarrow \mathrm{Dual}\,(\mathrm{sqrt} (3\sin(2)))\, (3\cos(2)) / (2\,\mathrm{sqrt}(3\sin(2))) \\  &\longrightarrow 1.6516332160855343 -0.3779412091869595  \end{aligned}

A Simple Application

In order not to make this blog post too long let us apply AD to finding parameters for a simple regression. The application to ANNs is described in a previous blog post. Note that in a real application we would use the the Haskell AD and furthermore use reverse AD as in this case it would be more efficient.

First our cost function

\displaystyle   L(\boldsymbol{x}, \boldsymbol{y}, m, c) = \frac{1}{2n}\sum_{i=1}^n (y - (mx + c))^2

> cost m c xs ys = (/ (2 * (fromIntegral $ length xs))) $
>                  sum $
>                  zipWith errSq xs ys
>   where
>     errSq x y = z * z
>       where
>         z = y - (m * x + c)
ghci> :t cost
  cost :: Fractional a => a -> a -> [a] -> [a] -> a

Some test data

> xs = [1,2,3,4,5,6,7,8,9,10]
> ys = [3,5,7,9,11,13,15,17,19,21]

and a learning rate

> gamma = 0.04

Now we create a function of the two parameters in our model by applying the cost function to the data. We need the (partial) derivatives of both the slope and the offset.

> g m c = cost m c xs ys

Now we can take use our Dual numbers to calculate the required partial derivatives and update our estimates of the parameter. We create a stream of estimates.

> zs = (0.1, 0.1) : map f zs
>   where
> 
>     deriv (Dual _ x') = x'
> 
>     f (c, m) = (c - gamma * cDeriv, m - gamma * mDeriv)
>       where
>         cDeriv = deriv $ g (constD m) $ idD c
>         mDeriv = deriv $ flip g (constD c) $ idD m

And we can calculate the cost of each estimate to check our algorithm converges and then take the the estimated parameters when the change in cost per iteration has reached an acceptable level.

ghci> take 2 $ drop 1000 $ map (\(c, m) -> cost m c xs ys) zs
  [1.9088215184565296e-9,1.876891490619424e-9]

ghci> take 2 $ drop 1000 zs
  [(0.9998665320141327,2.0000191714150106),(0.999867653022265,2.0000190103927853)]

Concluding Thoughts

Efficiency

Perhaps AD is underused because of efficiency?

It seems that the Financial Services industry is aware that AD is more efficient than current practice albeit the technique is only slowly permeating. Order of magnitude improvements have been reported.

Perhaps AD is slowly permeating into Machine Learning as well but there seem to be no easy to find benchmarks.

Automatic Differentiation Tools

If it were only possible to implement automatic differentiation in Haskell then its applicability would be somewhat limited. Fortunately this is not the case and it can be used in many languages.

In general, there are three different approaches:

  • Operator overloading: available for Haskell and C++. See the Haskell ad package and the C++ FADBAD approach using templates.

  • Source to source translators: available for Fortran, C and other languages e.g., ADIFOR, TAPENADE and see the wikipedia entry for a more comprehensive list.

  • New languages with built-in AD primitives. I have not listed any as it seems unlikely that anyone practicing Machine Learning would want to transfer their existing code to a research language. Maybe AD researchers could invest time in understanding what language feature improvements are needed to support AD natively in existing languages.

Neural Networks and Automated Differentiation

Introduction

Neural networks are a method for classifying data based on a theory of how biological systems operate. They can also be viewed as a generalization of logistic regression. A method for determining the coefficients of a given model, backpropagation, was developed in the 1970’s and rediscovered in the 1980’s.

The article “A Functional Approach to Neural Networks” in the Monad Reader shows how to use a neural network to classify handwritten digits in the MNIST database using backpropagation.

The reader is struck by how similar backpropagation is to automatic differentiation. The reader may not therefore be surprised to find that this observation had been made before: Domke2009a. Indeed as Dan Piponi observes: “the grandaddy machine-learning algorithm of them all, back-propagation, is nothing but steepest descent with reverse mode automatic differentiation”.

Neural Networks

We can view neural nets or at least a multi layer perceptron as a generalisation of (multivariate) linear logistic regression.

We follow (Rojas 1996; Bishop 2006). We are given a training set:

\displaystyle  \{(\boldsymbol{x}_0, \boldsymbol{y}_0), (\boldsymbol{x}_1, \boldsymbol{y}_1), \ldots, (\boldsymbol{x}_p, \boldsymbol{y}_p)\}

of pairs of n-dimensional and m-dimensional vectors called the input and output patterns in Machine Learning parlance. We wish to build a neural network model using this training set.

A neural network model (or at least the specific model we discuss: the multi-layer perceptron) consists of a sequence of transformations. The first transformation creates weighted sums of the inputs.

\displaystyle  a_j^{(1)} = \sum_{i=1}^{K_0} w^{(1)}_{ij}x_i + w_{0j}^{(1)}

where K_0 \equiv n is the size of the input vector and there are j = 1,\ldots,K_1 neurons in the so called first hidden layer of the network. The weights are unknown.

The second transformation then applies a non-linear activation function f to each a_j to give the output from the j-th neuron in the first hidden layer.

\displaystyle  z_j^{(1)} = f(a_j^{(1)})

Typically, f is chosen to be \tanh or the logistic function. Note that if it were chosen to be the identity then our neural network would be the same as a multivariate linear logistic regression.

We now repeat these steps for the second hidden layer:

Ultimately after we applied L-1 transformations (through L-1 hidden layers) we produce some output:

We show an example neural in the diagram below.

The input layer has 7 nodes. There are 2 hidden layers, the first has 3 nodes and the second has 5. The output layer has 3 nodes.

We are also given a cost function:

\displaystyle  E(\boldsymbol{w}; \boldsymbol{x}, \boldsymbol{y}) = \frac{1}{2}\|(\hat{\boldsymbol{y}} - \boldsymbol{y})\|^2

where \hat{\boldsymbol{y}} is the predicted output of the neural net and \boldsymbol{y} is the observed output.

As with logistic regression, our goal is to find weights for the neural network which minimises this cost function. We initialise the weights to some small non-zero amount and then use the method of steepest descent (aka gradient descent). The idea is that if f is a function of several variables then to find its minimum value, one ought to take a small step in the direction in which it is decreasing most quickly and repeat until no step in any direction results in a decrease. The analogy is that if one is walking in the mountains then the quickest way down is to walk in the direction which goes down most steeply. Of course one get stuck at a local minimum rather than the global minimum but from a machine learning point of view this may be acceptable; alternatively one may start at random points in the search space and check they all give the same minimum.

We therefore need calculate the gradient of the loss function with respect to the weights (since we need to minimise the cost function). In other words we need to find:

\displaystyle  \nabla E(\boldsymbol{x}) \equiv (\frac{\partial E}{\partial w_1}, \ldots, \frac{\partial E}{\partial w_n})

Once we have this we can take our random starting position and move down the steepest gradient:

\displaystyle  w'_i = w_i - \gamma\frac{\partial E}{\partial w_i}

where \gamma is the step length known in machine learning parlance as the learning rate.

Haskell Foreword

Some pragmas and imports required for the example code.

> {-# LANGUAGE RankNTypes                #-}
> {-# LANGUAGE DeriveFunctor             #-}
> {-# LANGUAGE DeriveFoldable            #-}
> {-# LANGUAGE DeriveTraversable         #-}
> {-# LANGUAGE ScopedTypeVariables       #-}
> {-# LANGUAGE TupleSections             #-}
> {-# LANGUAGE NoMonomorphismRestriction #-}

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}

> module NeuralNet
>        ( test1
>        , test2
>        , test3
>        ) where

> import Numeric.AD
> import Numeric.AD.Types

> import Data.Traversable (Traversable)
> import Data.Foldable (Foldable)
> import Data.List
> import Data.List.Split
> import System.Random
> import qualified Data.Vector as V

> import Control.Monad
> import Control.Monad.State

> import Data.Random ()
> import Data.Random.Distribution.Beta
> import Data.Random.Distribution.Uniform
> import Data.RVar

> import Text.Printf

Logistic Regression Redux

Let us first implement logistic regression. This will give us a reference against which to compare the equivalent solution expressed as a neural network.

Instead of maximimizing the log likelihood, we will minimize a cost function.

> cost :: Floating a => V.Vector a -> a -> V.Vector a -> a
> cost theta y x = 0.5 * (y - yhat)^2
>   where
>     yhat = logit $ V.sum $ V.zipWith (*) theta x

> logit :: Floating a =>
>          a -> a
> logit x = 1 / (1 + exp (negate x))

We add a regularization term into the total cost so that the parameters do not grow too large. Note that we do not regularize over the bias.

> delta :: Floating a => a
> delta = 0.01

> totalCost :: Floating a =>
>              V.Vector a ->
>              V.Vector a ->
>              V.Vector (V.Vector a) ->
>              a
> totalCost theta y x = (a + delta * b) / l
>   where
>     l = fromIntegral $ V.length y
>     a = V.sum $ V.zipWith (cost theta) y x
>     b = (/2) $ V.sum $ V.map (^2) $ V.drop 1 theta

We determine the gradient of the regularized cost function.

> delTotalCost :: Floating a =>
>                 V.Vector a ->
>                 V.Vector (V.Vector a) ->
>                 V.Vector a ->
>                 V.Vector a
> delTotalCost y x = grad f
>   where
>     f theta = totalCost theta (V.map auto y) (V.map (V.map auto) x)

And finally we can apply gradient descent.

> gamma :: Double
> gamma = 0.4

> stepOnceCost :: Floating a =>
>                  a ->
>                  V.Vector a ->
>                  V.Vector (V.Vector a) ->
>                  V.Vector a ->
>                  V.Vector a
> stepOnceCost gamma y x theta =
>   V.zipWith (-) theta (V.map (* gamma) $ del theta)
>     where
>       del = delTotalCost y x

Neural Network Representation

Let us borrow, generalize and prune the data structures used in “A Functional Approach to Neural Networks”. Some of the fields in the borrowed data structures are probably no longer necessary given that we are going to use automated differentiation rather than backpropagation. Caveat lector!

The activation function itself is a function which takes any type in the Floating class to the same type in the Floating class e.g. Double.

> newtype ActivationFunction =
>   ActivationFunction
>   {
>     activationFunction :: Floating a => a -> a
>   }

A neural network is a collection of layers.

> data Layer a =
>   Layer
>   {
>     layerWeights  :: [[a]],
>     layerFunction :: ActivationFunction
>   } deriving (Functor, Foldable, Traversable)

> data BackpropNet a = BackpropNet
>     {
>       layers       :: [Layer a],
>       learningRate :: Double
>     } deriving (Functor, Foldable, Traversable)

We need some helper functions to build our neural network and to extract information from it.

> buildBackpropNet ::
>   Double ->
>   [[[a]]] ->
>   ActivationFunction ->
>   BackpropNet a
> buildBackpropNet learningRate ws f =
>   BackpropNet {
>       layers       = map buildLayer checkedWeights
>     , learningRate = learningRate
>     }
>   where checkedWeights = scanl1 checkDimensions ws
>         buildLayer w   = Layer { layerWeights  = w
>                                 , layerFunction = f
>                                 }
>         checkDimensions :: [[a]] -> [[a]] -> [[a]]
>         checkDimensions w1 w2 =
>           if 1 + length w1 == length (head w2)
>           then w2
>           else error $ "Inconsistent dimensions in weight matrix\n" ++
>                         show (length w1)        ++ "\n" ++
>                         show (length w2)        ++ "\n" ++
>                         show (length $ head w1) ++ "\n" ++
>                         show (length $ head w2)

> extractWeights :: BackpropNet a -> [[[a]]]
> extractWeights x = map layerWeights $ layers x

In order to undertake gradient descent on the data structure in which we store a neural network, BackpropNet, it will be convenient to be able to add such structures together point-wise.

> instance Num a => Num (Layer a) where
>   (+) = addLayer
> 
> addLayer :: Num a => Layer a -> Layer a -> Layer a
> addLayer x y =
>   Layer { layerWeights  = zipWith (zipWith (+))
>                                   (layerWeights x)
>                                   (layerWeights y)
>         , layerFunction = layerFunction x
>         }

> instance Num a => Num (BackpropNet a) where
>   (+) = addBPN

> addBPN :: Num a => BackpropNet a -> BackpropNet a -> BackpropNet a
> addBPN x y = BackpropNet { layers = zipWith (+) (layers x) (layers y)
>                           , learningRate = learningRate x
>                           }

We store information about updating of output values in each layer in the neural network as we move forward through the network (aka forward propagation).

> data PropagatedLayer a
>     = PropagatedLayer
>         {
>           propLayerIn         :: [a],
>           propLayerOut        :: [a],
>           propLayerWeights    :: [[a]],
>           propLayerActFun     :: ActivationFunction
>         }
>     | PropagatedSensorLayer
>         {
>           propLayerOut :: [a]
>         } deriving (Functor, Foldable, Traversable)

Sadly we have to use an inefficient calculation to multiply matrices; see this email for further details.

> matMult :: Num a => [[a]] -> [a] -> [a]
> matMult m v = result
>   where
>     lrs = map length m
>     l   = length v
>     result = if all (== l) lrs
>              then map (\r -> sum $ zipWith (*) r v) m
>              else error $ "Matrix has rows of length " ++ show lrs ++
>                           " but vector is of length " ++ show l

Now we can propagate forwards. Note that the code from which this is borrowed assumes that the inputs are images which are m \times m pixels each encoded using a grayscale, hence the references to bits and the check that values lie in the range 0 \leq x \leq 1.

> propagateNet :: (Floating a, Ord a, Show a) =>
>                 [a] ->
>                 BackpropNet a ->
>                 [PropagatedLayer a]
> propagateNet input net = tail calcs
>   where calcs = scanl propagate layer0 (layers net)
>         layer0 = PropagatedSensorLayer $ validateInput net input
> 
>         validateInput net = validateInputValues .
>                             validateInputDimensions net
> 
>         validateInputDimensions net input =
>           if got == expected
>           then input
>           else error ("Input pattern has " ++ show got ++
>                       " bits, but " ++
>                       show expected ++ " were expected")
>           where got      = length input
>                 expected = (+(negate 1)) $
>                            length $
>                            head $
>                            layerWeights $
>                            head $
>                            layers net
> 
>         validateInputValues input =
>           if (minimum input >= 0) && (maximum input <= 1)
>           then input
>           else error "Input bits outside of range [0,1]"

Note that we add a 1 to the inputs to each layer to give the bias.

> propagate :: (Floating a, Show a) =>
>              PropagatedLayer a ->
>              Layer a ->
>              PropagatedLayer a
> propagate layerJ layerK = result
>   where
>     result =
>       PropagatedLayer
>         {
>           propLayerIn         = layerJOut,
>           propLayerOut        = map f a,
>           propLayerWeights    = weights,
>           propLayerActFun     = layerFunction layerK
>         }
>     layerJOut = propLayerOut layerJ
>     weights   = layerWeights layerK
>     a = weights `matMult` (1:layerJOut)
>     f :: Floating a => a -> a
>     f = activationFunction $ layerFunction layerK

> evalNeuralNet :: (Floating a, Ord a, Show a) =>
>                  BackpropNet a -> [a] -> [a]
> evalNeuralNet net input = propLayerOut $ last calcs
>   where calcs = propagateNet input net

We define a cost function.

> costFn :: (Floating a, Ord a, Show a) =>
>           Int ->
>           Int ->
>           [a] ->
>           BackpropNet a ->
>           a
> costFn nDigits expectedDigit input net = 0.5 * sum (map (^2) diffs)
>   where
>     predicted = evalNeuralNet net input
>     diffs = zipWith (-) ((targets nDigits)!!expectedDigit) predicted

> targets :: Floating a => Int -> [[a]]
> targets nDigits = map row [0 .. nDigits - 1]
>   where
>     row m = concat [x, 1.0 : y]
>       where
>         (x, y) = splitAt m (take (nDigits - 1) $ repeat 0.0)

If instead we would rather perform gradient descent over the whole training set (rather than stochastically) then we can do so. Note that we do not regularize the weights for the biases.

> totalCostNN :: (Floating a, Ord a, Show a) =>
>                Int ->
>                V.Vector Int ->
>                V.Vector [a] ->
>                BackpropNet a ->
>                a
> totalCostNN nDigits expectedDigits inputs net = cost
>   where
>     cost = (a + delta * b) / l
> 
>     l = fromIntegral $ V.length expectedDigits
> 
>     a = V.sum $ V.zipWith (\expectedDigit input ->
>                             costFn nDigits expectedDigit input net)
>                           expectedDigits inputs
> 
>     b = (/(2 * m)) $ sum $ map (^2) ws
> 
>     m = fromIntegral $ length ws
> 
>     ws = concat $ concat $
>          map stripBias $
>          extractWeights net
> 
>     stripBias xss = map (drop 1) xss

> delTotalCostNN :: (Floating a, Ord a, Show a) =>
>                   Int ->
>                   V.Vector Int ->
>                   V.Vector [a] ->
>                   BackpropNet a ->
>                   BackpropNet a
> delTotalCostNN nDigits expectedDigits inputs = grad f
>   where
>     f net = totalCostNN nDigits expectedDigits
>                         (V.map (map auto) inputs) net

> stepOnceTotal :: Int ->
>                  Double ->
>                  V.Vector Int ->
>                  V.Vector [Double] ->
>                  BackpropNet Double ->
>                  BackpropNet Double
> stepOnceTotal nDigits gamma y x net =
>   net + fmap (* (negate gamma)) (delTotalCostNN nDigits y x net)

Example I

Let’s try it out. First we need to generate some data. Rather arbitrarily let us create some populations from the beta distribution.

> betas :: Int -> Double -> Double -> [Double]
> betas n a b =
>   fst $ runState (replicateM n (sampleRVar (beta a b))) (mkStdGen seed)
>     where
>       seed = 0

We can plot the populations we wish to distinguish by sampling.

> a, b :: Double
> a          = 15
> b          = 6
> nSamples :: Int
> nSamples   = 100000
> 
> sample0, sample1 :: [Double]
> sample0 = betas nSamples a b
> sample1 = betas nSamples b a

> mixSamples :: [Double] -> [Double] -> [(Double, Double)]
> mixSamples xs ys = unfoldr g ((map (0,) xs), (map (1,) ys))
>   where
>     g ([], [])         = Nothing
>     g ([],  _)         = Nothing
>     g ( _, [])         = Nothing
>     g ((x:xs), (y:ys)) = Just $ (x, (y:ys, xs))

> createSample :: V.Vector (Double, Double)
> createSample = V.fromList $ take 100 $ mixSamples sample1 sample0

> lRate :: Double
> lRate = 0.01
> actualTheta :: V.Vector Double
> actualTheta = V.fromList [0.0, 1.0]
> initTheta :: V.Vector Double
> initTheta = V.replicate (V.length actualTheta) 0.1

> logitAF :: ActivationFunction
> logitAF = ActivationFunction logit

> test1 :: IO ()
> test1 = do
> 
>   let testNet = buildBackpropNet lRate [[[0.1, 0.1], [0.1, 0.1]]] logitAF

>   let vals :: V.Vector (Double, V.Vector Double)
>       vals = V.map (\(y, x) -> (y, V.fromList [1.0, x])) $ createSample
> 
>   let gs = iterate (stepOnceCost gamma (V.map fst vals) (V.map snd vals))
>                    initTheta
>       theta = head $ drop 1000 gs
>   printf "Logistic regression: theta_0 = %5.3f, theta_1 = %5.3f\n"
>          (theta V.! 0) (theta V.! 1)
> 
>   let us = V.map (round . fst) createSample
>   let vs = V.map snd createSample
>   let fs = iterate (stepOnceTotal 2 gamma us (V.map return vs)) testNet
>       phi = extractWeights $ head $ drop 1000 fs
>   printf "Neural network: theta_00 = %5.3f, theta_01 = %5.3f\n"
>          (((phi!!0)!!0)!!0) (((phi!!0)!!0)!!1)
>   printf "Neural network: theta_10 = %5.3f, theta_11 = %5.3f\n"
>          (((phi!!0)!!1)!!0) (((phi!!0)!!1)!!1)

ghci> test1
  Logistic regression: theta_0 = -2.383, theta_1 = 4.852
  Neural network: theta_00 = 2.386, theta_01 = -4.861
  Neural network: theta_10 = -2.398, theta_11 = 4.886

Example II

Now let’s try a neural net with 1 hidden layer using the data we prepared earlier.

We seed the weights in the neural with small random values; if we set all the weights to 0 then the gradient descent algorithm might get stuck.

> uniforms :: Int -> [Double]
> uniforms n =
>   fst $ runState (replicateM n (sampleRVar stdUniform)) (mkStdGen seed)
>     where
>       seed = 0

> randomWeightMatrix :: Int -> Int -> [[Double]]
> randomWeightMatrix numInputs numOutputs = y
>   where
>     y = chunksOf numInputs weights
>     weights = map (/ 100.0) $ uniforms (numOutputs * numInputs)

> w1, w2 :: [[Double]]
> w1  = randomWeightMatrix 2 2
> w2  = randomWeightMatrix 3 2

> initNet2 :: BackpropNet Double
> initNet2 = buildBackpropNet lRate [w1, w2] logitAF
> 
> labels :: V.Vector Int
> labels = V.map (round . fst) createSample

> inputs :: V.Vector [Double]
> inputs = V.map (return . snd) createSample

Instead of hand-crafting gradient descent, let us use the library function as it performs better and is easier to implement.

> estimates :: (Floating a, Ord a, Show a) =>
>              V.Vector Int ->
>              V.Vector [a] ->
>              BackpropNet a ->
>              [BackpropNet a]
> estimates y x = gradientDescent $
>                 \theta -> totalCostNN 2 y (V.map (map auto) x) theta

Now we can examine the weights of our fitted neural net and apply it to some test data.

> test2 :: IO ()
> test2 = do
> 
>   let fs = estimates labels inputs initNet2
>   mapM_ putStrLn $ map (take 60) $
>                    map show $ extractWeights $
>                    head $ drop 1000 fs
>   putStrLn $ show $ evalNeuralNet (head $ drop 1000 fs) [0.1]
>   putStrLn $ show $ evalNeuralNet (head $ drop 1000 fs) [0.9]

ghci> test2
  [[3.3809809537916933,-6.778365921046131],[-5.157492699008754
  [[1.2771246165025043,5.294090869351353,-8.264801192310706],[
  [0.997782987249909,2.216698813392053e-3]
  [1.4853346509852003e-3,0.9985148392767443]

Example III

Let’s try a more sophisticated example and create a population of 4 groups which we measure with 2 variables.

> c, d :: Double
> c          = 15
> d          = 8
> sample2, sample3 :: [Double]
> sample2 = betas nSamples c d
> sample3 = betas nSamples d c

> mixSamples3 :: Num t => [[a]] -> [(t, a)]
> mixSamples3 xss = concat $ transpose $
>                   zipWith (\n xs -> map (n,) xs)
>                           (map fromIntegral [0..])
>                           xss
> sample02, sample03, sample12, sample13 :: [(Double, Double)]
> sample02 = [(x, y) | x <- sample0, y <- sample2]
> sample03 = [(x, y) | x <- sample0, y <- sample3]
> sample12 = [(x, y) | x <- sample1, y <- sample2]
> sample13 = [(x, y) | x <- sample1, y <- sample3]

> createSample3 :: forall t. Num t => V.Vector (t, (Double, Double))
> createSample3 = V.fromList $ take 512 $ mixSamples3 [ sample02
>                                                     , sample03
>                                                     , sample12
>                                                     , sample13
>                                                     ]

Rather annoyingly picking random weights seemed to give a local but not global minimum. This may be a feature of having more nodes in the hidden layer than in the input layer. By fitting a neural net with no hidden layers to the data and using the outputs as inputs to fit another neural net with no hidden layers, we can get a starting point from which we can converge to the global minimum.

> w31, w32 :: [[Double]]
> w31 = [[-1.795626449637491,1.0687662199549477,0.6780994566671094],
>        [-0.8953174631646047,1.536931540024011,-1.7631220370122578],
>        [-0.4762453998497917,-2.005243268058972,1.2945899127545906],
>        [0.43019763097582875,-1.5711869072989957,-1.187180183656747]]
> w32 = [[-0.65116209142284,0.4837310591797774,-0.17870333721054968,
>         -0.6692619856605464,-1.062292154441557],
>        [-0.7521274440366631,-1.2071835415415136e-2,1.0078929981538551,
>         -1.3144243587577473,-0.5102027925579049],
>        [-0.7545728756863981,-0.4830112128458844,-1.2901624541811962,
>         1.0487049495446408,9.746209726152217e-3],
>        [-0.8576212271328413,-0.9035219951783956,-0.4034500456652809,
>         0.10091187689838758,0.781835908789879]]
> 
> testNet3 :: BackpropNet Double
> testNet3 = buildBackpropNet lRate [w31, w32] logitAF

> labels3 :: V.Vector Int
> labels3 = V.map (round . fst) createSample3
> inputs3 :: V.Vector [Double]
> inputs3 = V.map ((\(x, y) -> [x, y]) . snd) createSample3

Now we use the library gradientDescent function to generate neural nets which ever better fit the data.

> estimates3 :: (Floating a, Ord a, Show a) =>
>               V.Vector Int ->
>               V.Vector [a] ->
>               BackpropNet a ->
>               [BackpropNet a]
> estimates3 y x = gradientDescent $
>                  \theta -> totalCostNN 4 y (V.map (map auto) x) theta

Finally we can fit a neural net and check that it correctly classifies some data.

> test3 :: IO ()
> test3 = do
>   let fs = drop 100 $ estimates3 labels3 inputs3 testNet3
>   mapM_ putStrLn $ map (take 60) $ map show $ extractWeights $ head fs
>   putStrLn $ take 60 $ show $ evalNeuralNet (head fs) [0.1, 0.1]
>   putStrLn $ take 60 $ show $ evalNeuralNet (head fs) [0.1, 0.9]
>   putStrLn $ take 60 $ show $ evalNeuralNet (head fs) [0.9, 0.1]
>   putStrLn $ take 60 $ show $ evalNeuralNet (head fs) [0.9, 0.9]

ghci> test3
  [[-2.295896239599931,2.705409060274802,2.1377566388724047],[
  [[-0.6169787627963551,2.5369568963968256,-0.3515306366626614
  [2.638026636449198e-2,9.091308688841797e-2,0.373349222824566
  [0.13674565454319784,1.128123133092104e-2,0.8525700090804755
  [0.30731134024095474,0.8197492648500939,1.3704140162804749e-
  [0.6773814649389487,0.22533958204471505,0.1957913744022863,4

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc.

Rojas, R. 1996. Neural networks: a systematic introduction. Springer-Verlag New York Incorporated. http://books.google.co.uk/books?id=txsjjYzFJS4C.

Logistic Regression and Automated Differentiation

Introduction

Having shown how to use automated differentiation to estimate parameters in the case of linear regression let us now turn our attention to the problem of classification. For example, we might have some data about people’s social networking such as volume of twitter interactions and number of twitter followers together with a label which represents a human judgement about which one of the two individuals is more influential. We would like to predict, for a pair of individuals, the human judgement on who is more influential.

Logistic Regression

We define the probability of getting a particular value of the binary label:

\displaystyle   \begin{aligned}  {\mathbb P}(y = 1 \mid \boldsymbol{x}; \boldsymbol{\theta}) &= h_{\boldsymbol{\theta}}(\boldsymbol{x}) \\  {\mathbb P}(y = 0 \mid \boldsymbol{x}; \boldsymbol{\theta}) &= 1 - h_{\boldsymbol{\theta}}(\boldsymbol{x})  \end{aligned}

where \boldsymbol{x^{(i)}} and \boldsymbol{\theta} are column vectors of size m

\displaystyle   h_{\boldsymbol{\theta}}(\boldsymbol{x}) = g(\boldsymbol{\theta}^T\boldsymbol{x})

and g is a function such as the logistic function g(x) = 1 / (1 + e^{-x}) or \tanh.

We can re-write this as:

\displaystyle   p(y \mid \boldsymbol{x} ; \boldsymbol{\theta}) = (h_{\boldsymbol{\theta}}(\boldsymbol{x}))^y(1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}))^{1 - y}

We wish to find the value of \boldsymbol{\theta} that gives the maximum probability to the observations. We do this by maximising the likelihood. Assuming we have n observations the likelihood is:

\displaystyle   \begin{aligned}  \mathcal{L}(\boldsymbol{\theta}) &= \prod_{i=1}^n p(y^{(i)} \mid {\boldsymbol{x}}^{(i)} ; \boldsymbol{\theta}) \\            &= \prod_{i=1}^n (h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}))^{y^{(i)}} (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}))^{1 - y^{(i)}}  \end{aligned}

It is standard practice to maximise the log likelihood which will give the same maximum as log is monotonic.

\displaystyle   \begin{aligned}  \lambda(\boldsymbol{\theta}) &= \log \mathcal{L}(\boldsymbol{\theta}) \\            &= \sum_{i=1}^n {y^{(i)}}\log h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}) + (1 - y^{(i)})\log (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}))  \end{aligned}

In order to maximize the cost function, we again use the method of steepest ascent: if \boldsymbol{\theta}^i is a guess for the parameters of the model then we can improve the guess by stepping a small distance in the direction of greatest change.

\displaystyle   \boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}^{i} - \gamma \nabla\mathcal{J}(\boldsymbol{\theta})

\gamma is some constant known in machine learning as the learning rate. It must be chosen to be large enough to ensure convergence within a reasonable number of steps but not so large that the algorithm fails to converge.

When the number of observations is high then the cost of evaluating the cost function can be high; as a cheaper alternative we can use stochastic gradient descent. Instead of taking the gradient with respect to all the observations, we take the gradient with respect to each observation in our data set. Of course if our data set is small we may have to use the data set several times to achieve convergence.

When the observations / training data are linearly separable then the magnitude of the parameters can grow without bound as the (parametized) logistic function then tends to the Heaviside / step function. Moreover, it is obvious that there can be more than one separaing hyperplane in this circumstance. To circumvent these infelicities, we instead maximize a penalized log likelihood function:

\displaystyle   \sum_{i=1}^n {y^{(i)}}\log h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}) + (1 - y^{(i)})\log (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)})) - \frac{\delta}{2}\|\boldsymbol{\theta}\|^2

See Bishop and Mitchell for further details.

Implementation

Some pragmas to warn us about potentially dangerous situations.

> {-# OPTIONS_GHC -Wall                    #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults  #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> 
> {-# LANGUAGE TupleSections #-}
> 
> module Logistic ( betas
>                 , main
>                 , a
>                 , b
>                 , nSamples
>                 ) where

Modules from the automatic differentiation library.

> import Numeric.AD
> import Numeric.AD.Types
> import qualified Data.Vector as V
> import Control.Monad
> import Control.Monad.State
> import Data.List
> import Text.Printf

Some modules from a random number generator library as we will want to generate some test data.

> import System.Random
> import Data.Random ()
> import Data.Random.Distribution.Beta
> import Data.RVar

Our model: the probability that y has the label 1 given the observations \boldsymbol{x}.

> logit :: Floating a =>
>          a -> a
> logit x = 1 / (1 + exp (negate x))

For each observation, the log likelihood:

> logLikelihood :: Floating a => V.Vector a -> a -> V.Vector a -> a
> logLikelihood theta y x = y * log (logit z) +
>                           (1 - y) * log (1 - logit z)
>   where
>     z = V.sum $ V.zipWith (*) theta x
> totalLogLikelihood :: Floating a =>
>                       V.Vector a ->
>                       V.Vector a ->
>                       V.Vector (V.Vector a) ->
>                       a
> totalLogLikelihood theta y x = (a - delta * b) / l
>   where
>     l = fromIntegral $ V.length y
>     a = V.sum $ V.zipWith (logLikelihood theta) y x
>     b = (/2) $ V.sum $ V.map (^2) theta

As before we can implement steepest descent using the grad function.

> delTotalLogLikelihood :: Floating a =>
>                 V.Vector a ->
>                 V.Vector (V.Vector a) ->
>                 V.Vector a ->
>                 V.Vector a
> delTotalLogLikelihood y x = grad f
>   where
>     f theta = totalLogLikelihood theta
>                                  (V.map auto y)
>                                  (V.map (V.map auto) x)
> 
> stepOnce :: Double ->
>             V.Vector Double ->
>             V.Vector (V.Vector Double) ->
>             V.Vector Double ->
>             V.Vector Double
> stepOnce gamma y x theta =
>   V.zipWith (+) theta (V.map (* gamma) $ del theta)
>   where
>     del = delTotalLogLikelihood y x

Or even easier just use the library function gradientAscent!

> estimates :: (Floating a, Ord a) =>
>              V.Vector a ->
>              V.Vector (V.Vector a) ->
>              V.Vector a ->
>              [V.Vector a]
> estimates y x = gradientAscent $
>                 \theta -> totalLogLikelihood theta
>                                              (V.map auto y)
>                                              (V.map (V.map auto) x)

Let’s try it out. First we need to generate some data. Rather arbitrarily let us create some populations from the beta distribution.

> betas :: Int -> Double -> Double -> [Double]
> betas n a b =
>   fst $ runState (replicateM n (sampleRVar (beta a b))) (mkStdGen seed)
>     where
>       seed = 0

We can plot the populations we wish to distinguish by sampling.

> a, b :: Double
> a          = 15
> b          = 6
> nSamples :: Int
> nSamples   = 100000
> 
> sample0, sample1 :: [Double]
> sample0 = betas nSamples a b
> sample1 = betas nSamples b a

Note that in this case we could come up with a classification rule by inspecting the histograms. Furthermore, the populations overlap which means we will inevitably mis-classify some observations.

> mixSamples :: [Double] -> [Double] -> [(Double, Double)]
> mixSamples xs ys = unfoldr g ((map (0,) xs), (map (1,) ys))
>   where
>     g ([], [])         = Nothing
>     g ([],  _)         = Nothing
>     g ( _, [])         = Nothing
>     g ((x:xs), (y:ys)) = Just $ (x, (y:ys, xs))
> createSample :: V.Vector (Double, Double)
> createSample = V.fromList $ take 100 $ mixSamples sample1 sample0

We create a model with one independent variables and thus two parameters.

> actualTheta :: V.Vector Double
> actualTheta = V.fromList [0.0, 1.0]

We initialise our algorithm with arbitrary values.

> initTheta :: V.Vector Double
> initTheta = V.replicate (V.length actualTheta) 0.1

Set the learning rate, the strength of the penalty term and the number of iterations.

> gamma :: Double
> gamma = 0.1
> 
> delta :: Floating a => a
> delta = 1.0
> 
> nIters :: Int
> nIters = 4000

Now we can run our example. For the constant parameter of our model (aka in machine learning as the bias) we ensure that the correspoding “independent variable” is always set to 1.0.

> vals :: V.Vector (Double, V.Vector Double)
> vals = V.map (\(y, x) -> (y, V.fromList [1.0, x])) $ createSample
> main :: IO ()
> main = do
>   let u = V.map fst vals
>       v = V.map snd vals
>       hs = iterate (stepOnce gamma u v) initTheta
>       xs = V.map snd vals
>       theta =  head $ drop nIters hs
>       theta' = head $ drop 100 $ estimates u v initTheta
>   printf "Hand crafted descent: theta_0 = %5.3f, theta_1 = %5.3f\n"
>          (theta V.! 0) (theta V.! 1)
>   printf "Library descent:      theta_0 = %5.3f, theta_1 = %5.3f\n"
>          (theta' V.! 0) (theta' V.! 1)
>   let predProbs  = V.map (\x -> logit $ V.sum $ V.zipWith (*) theta x) xs
>       mismatches = V.filter (> 0.5) $
>                    V.map abs $
>                    V.zipWith (-) actuals preds
>         where
>           actuals = V.map fst vals
>           preds   = V.map (\x -> fromIntegral $ fromEnum (x > 0.5))
>                           predProbs
>   let lActuals, lMisMatches :: Double
>       lActuals    = fromIntegral $ V.length vals
>       lMisMatches = fromIntegral $ V.length mismatches
>   printf "%5.2f%% correct\n" $
>          100.0 *  (lActuals - lMisMatches) / lActuals

And we get quite reasonable estimates:

ghci> main
  Hand crafted descent: theta_0 = -2.071, theta_1 = 4.318
  Library descent:      theta_0 = -2.070, theta_1 = 4.319
  97.00% correct

Regression and Automated Differentiation

Introduction

Automated differentiation was developed in the 1960’s but even now does not seem to be that widely used. Even experienced and knowledgeable practitioners often assume it is either a finite difference method or symbolic computation when it is neither.

This article gives a very simple application of it in a machine learning / statistics context.

Multivariate Linear Regression

We model a dependent variable linearly dependent on some set of independent variables in a noisy environment.

\displaystyle   y^{(i)} = \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)} + \epsilon^{(i)}

where

  • i runs from 1 to n, the number of observations;

  • \epsilon^{(i)} are i.i.d. normal with mean 0 and the same variance \sigma^2: \epsilon^{(i)} \sim \mathcal{N} (0,\sigma^2);

  • For each i, \boldsymbol{x^{(i)}} is a column vector of size m and

  • \boldsymbol{\theta} is a column vector also of size m.

In other words:

\displaystyle   p(y^{(i)} \mid \boldsymbol{x}^{(i)}; \boldsymbol{\theta}) =  \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big)

We can therefore write the likelihood function given all the observations as:

\displaystyle   \mathcal{L}(\boldsymbol{\theta}; X, \boldsymbol{y}) =  \prod_{i = 1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big)

In order to find the best fitting parameters \boldsymbol{\theta} we therefore need to maximize this function with respect to \boldsymbol{\theta}. The standard approach is to maximize the log likelihood which, since log is monotonic, will give the same result.

\displaystyle   \begin{aligned}  \lambda(\boldsymbol{\theta}) &= \log \mathcal{L}(\boldsymbol{\theta}) \\                               &= \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big) \\                               &= n\log \frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2\sigma^2}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2  \end{aligned}

Hence maximizing the likelihood is the same as minimizing the (biased) estimate of the variance:

\displaystyle   \frac{1}{n}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2

We can define a cost function:

\displaystyle   \mathcal{J}(\boldsymbol{\theta}) = \frac{1}{2n}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2

Clearly minimizing this will give the same result. The constant 1/2 is to make the manipulation of the derivative easier. In our case, this is irrelevant as we are not going to derive the derivative explicitly but use automated differentiation.

In order to mininize the cost function, we use the method of steepest ascent (or in this case descent): if \boldsymbol{\theta}^i is a guess for the parameters of the model then we can improve the guess by stepping a small distance in the direction of greatest change.

\displaystyle   \boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}^{i} - \gamma \nabla\mathcal{J}(\boldsymbol{\theta})

\gamma is some constant known in machine learning as the learning rate. It must be chosen to be large enough to ensure convergence within a reasonable number of steps but not so large that the algorithm fails to converge.

When the number of observations is high then the cost of evaluating the cost function can be high; as a cheaper alternative we can use stochastic gradient descent. Instead of taking the gradient with respect to all the observations, we take the gradient with respect to each observation in our data set. Of course if our data set is small we may have to use the data set several times to achieve convergence.

Implementation

Some pragmas to warn us about potentially dangerous situations.

> {-# OPTIONS_GHC -Wall                    #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults  #-}
> module Linear where

Modules from the automatic differentiation library.

> import Numeric.AD
> import Numeric.AD.Types
> import qualified Data.Vector as V

Some modules from a random number generator library as we will want to generate some test data.

> import Data.Random ()
> import Data.Random.Distribution.Normal
> import Data.Random.Distribution.Uniform
> import Data.RVar

Our model: the predicted value of y is \hat{y} given the observations \boldsymbol{x}.

> yhat :: Floating a =>
>         V.Vector a ->
>         V.Vector a -> a
> yhat x theta = V.sum $ V.zipWith (*) theta x

For each observation, the “cost” of the difference between the actual value of y and its predicted value.

> cost :: Floating a =>
>         V.Vector a ->
>         a ->
>         V.Vector a
>         -> a
> cost theta y x = 0.5 * (y - yhat x theta)^2

To find its gradient we merely apply the operator grad.

> delCost :: Floating a =>
>            a ->
>            V.Vector a ->
>            V.Vector a ->
>            V.Vector a
> delCost y x = grad $ \theta -> cost theta (auto y) (V.map auto x)

We can use the single observation cost function to define the total cost function.

> totalCost :: Floating a =>
>              V.Vector a ->
>              V.Vector a ->
>              V.Vector (V.Vector a)
>              -> a
> totalCost theta y x = (/l) $ V.sum $ V.zipWith (cost theta) y x
>   where
>     l = fromIntegral $ V.length y

Again taking the derivative is straightforward.

> delTotalCost :: Floating a =>
>                 V.Vector a ->
>                 V.Vector (V.Vector a) ->
>                 V.Vector a ->
>                 V.Vector a
> delTotalCost y x = grad f
>   where
>     f theta = totalCost theta (V.map auto y) (V.map (V.map auto) x)

Now we can implement steepest descent.

> stepOnce :: Double ->
>             V.Vector Double ->
>             V.Vector (V.Vector Double) ->
>             V.Vector Double ->
>             V.Vector Double
> stepOnce gamma y x theta =
>   V.zipWith (-) theta (V.map (* gamma) $ del theta)
>   where
>     del = delTotalCost y x
> stepOnceStoch :: Double ->
>                  Double ->
>                  V.Vector Double ->
>                  V.Vector Double ->
>                  V.Vector Double
> stepOnceStoch gamma y x theta =
>   V.zipWith (-) theta (V.map (* gamma) $ del theta)
>   where
>     del = delCost y x

Let’s try it out. First we need to generate some data.

> createSample :: Double -> V.Vector Double -> IO (Double, V.Vector Double)
> createSample sigma2 theta = do
>   let l = V.length theta
>   x <- V.sequence $ V.replicate (l - 1) $ sampleRVar stdUniform
>   let mu = (theta V.! 0) + yhat x (V.drop 1 theta)
>   y <- sampleRVar $ normal mu sigma2
>   return (y, x)

We create a model with two independent variables and thus three parameters.

> actualTheta :: V.Vector Double
> actualTheta = V.fromList [0.0, 0.6, 0.7]

We initialise our algorithm with arbitrary values.

> initTheta :: V.Vector Double
> initTheta = V.replicate 3 0.1

We give our model an arbitrary variance.

> sigma2 :: Double
> sigma2 = 0.01

And set the learning rate and the number of iterations.

> nSamples, nIters:: Int
> nSamples = 100
> nIters = 2000
> gamma :: Double
> gamma = 0.1

Now we can run our example. For the constant parameter of our model (aka in machine learning as the bias) we ensure that the correspoding “independent variable” is always set to 1.0.

> main :: IO ()
> main = do
>   vals <- V.sequence $
>           V.replicate nSamples $
>           createSample sigma2 actualTheta
>   let y = V.map fst vals
>       x = V.map snd vals
>       x' =  V.map (V.cons 1.0) x
>       hs = iterate (stepOnce gamma y x') initTheta
>       update theta = V.foldl f theta $ V.zip y x'
>         where
>           f theta (y, x) = stepOnceStoch gamma y x theta
>   putStrLn $ show $ take 1 $ drop nIters hs
>   let f = foldr (.) id $ replicate nSamples update
>   putStrLn $ show $ f initTheta

And we get quite reasonable estimates for the parameter.

ghci> main
  [fromList [-8.34526742975572e-4,0.6024033722648041,0.69483650585735]]
  fromList [7.095387239884274e-5,0.6017197904731632,0.694335002002961]