Haskell Vectors and Sampling from a Categorical Distribution

Introduction

Suppose we have a vector of weights which sum to 1.0 and we wish to sample n samples randomly according to these weights. There is a well known trick in Matlab / Octave using sampling from a uniform distribution.

num_particles = 2*10^7
likelihood = zeros(num_particles,1);
likelihood(:,1) = 1/num_particles;
[_,index] = histc(rand(num_particles,1),[0;cumsum(likelihood/sum(likelihood))]);
s = sum(index);

Using tic and toc this produces an answer with

Elapsed time is 10.7763 seconds.

Haskell

I could find no equivalent function in Haskell nor could I easily find a binary search function.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE BangPatterns                 #-}
> import System.Random.MWC
> import qualified Data.Vector.Unboxed as V
> import Control.Monad.ST
> import qualified Data.Vector.Algorithms.Search as S
> import Data.Bits
> n :: Int
> n = 2*10^7

Let’s create some random data. For a change let’s use mwc-random rather than random-fu.

> vs  :: V.Vector Double
> vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))

Again, I could find no equivalent of cumsum but we can write our own.

> weightsV, cumSumWeightsV :: V.Vector Double
> weightsV = V.replicate n (recip $ fromIntegral n)
> cumSumWeightsV = V.scanl (+) 0 weightsV

Binary search on a sorted vector is straightforward and a cumulative sum ensures that the vector is sorted.

> binarySearch :: (V.Unbox a, Ord a) =>
>                 V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
>   where
>     loop !l !u
>       | u <= l    = l
>       | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
>       where k = l + (u - l) `shiftR` 1
> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs

To see how well this performs, let’s sum the indices (of course, we wouldn’t do this in practice) as we did for the Matlab implementation.

> js :: V.Vector Int
> js = indices (V.tail cumSumWeightsV) vs
> main :: IO ()
> main = do
>   print $ V.foldl' (+) 0 js

Using +RTS -s we get

Total   time   10.80s  ( 11.06s elapsed)

which is almost the same as the Matlab version.

I did eventually find a binary search function in vector-algorithms and since one should not re-invent the wheel, let us try using it.

> indices' :: (V.Unbox a, Ord a) => V.Vector a -> V.Vector a -> V.Vector Int
> indices' sv x = runST $ do
>   st <- V.unsafeThaw (V.tail sv)
>   V.mapM (S.binarySearch st) x
> main' :: IO ()
> main' = do
>   print $  V.foldl' (+) 0 $ indices' cumSumWeightsV vs

Again using +RTS -s we get

Total   time   11.34s  ( 11.73s elapsed)

So the library version seems very slightly slower.

Posted in Haskell, Probability | 8 Comments

Importance Sampling

Importance Sampling

Suppose we have an random variable X with pdf 1/2\exp{-\lvert x\rvert} and we wish to find its second moment numerically. However, the random-fu package does not support sampling from such as distribution. We notice that

\displaystyle   \int_{-\infty}^\infty x^2 \frac{1}{2} \exp{-\lvert x\rvert} \mathrm{d}x =  \int_{-\infty}^\infty x^2 \frac{\frac{1}{2} \exp{-\lvert x\rvert}}                                 {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}}                        \frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}  \,\mathrm{d}x

So we can sample from {\cal{N}}(0, 4) and evaluate

\displaystyle   x^2 \frac{\frac{1}{2} \exp{-\lvert x\rvert}}           {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}}

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> module Importance where
> import Control.Monad
> import Data.Random.Source.PureMT
> import Data.Random
> import Data.Random.Distribution.Binomial
> import Data.Random.Distribution.Beta
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> sampleImportance :: RVarT (W.Writer [Double]) ()
> sampleImportance = do
>   x <- rvarT $ Normal 0.0 2.0
>   let x2 = x^2
>       u = x2 * 0.5 * exp (-(abs x))
>       v = (exp ((-x2)/8)) * (recip (sqrt (8*pi)))
>       w = u / v
>   lift $ W.tell [w]
>   return ()
> runImportance :: Int -> [Double]
> runImportance n =
>   snd $
>   W.runWriter $
>   evalStateT (sample (replicateM n sampleImportance))
>              (pureMT 2)

We can run this 10,000 times to get an estimate.

ghci> import Formatting
ghci> format (fixed 2) (sum (runImportance 10000) / 10000)
  "2.03"

Since we know that the n-th moment of the exponential distribution is n! / \lambda^n where \lambda is the rate (1 in this example), the exact answer is 2 which is not too far from our estimate using importance sampling.

The value of

\displaystyle   w(x) = \frac{1}{N}\frac{\frac{1}{2} \exp{-\lvert x\rvert}}                         {\frac{1}{\sqrt{8\pi}}{\exp{-x^2/8}}}       = \frac{p(x)}{\pi(x)}

is called the weight, p is the pdf from which we wish to sample and \pi is the pdf of the importance distribution.

Importance Sampling Approximation of the Posterior

Suppose that the posterior distribution of a model in which we are interested has a complicated functional form and that we therefore wish to approximate it in some way. First assume that we wish to calculate the expectation of some arbitrary function f of the parameters.

\displaystyle   {\mathbb{E}}(f({x}) \,\vert\, y_1, \ldots y_T) =  \int_\Omega f({x}) p({x} \, \vert \, y_1, \ldots y_T) \,\mathrm{d}{x}

Using Bayes

\displaystyle   \int_\Omega f({x}) {p\left(x \,\vert\, y_1, \ldots y_T\right)} \,\mathrm{d}{x} =  \frac{1}{Z}\int_\Omega f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x) \,\mathrm{d}{x}

where Z is some normalizing constant.

As before we can re-write this using a proposal distribution \pi(x)

\displaystyle   \frac{1}{Z}\int_\Omega f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x) \,\mathrm{d}{x} =  \frac{1}{Z}\int_\Omega \frac{f({x}) {p\left(y_1, \ldots y_T \,\vert\, x\right)}p(x)}{\pi(x)}\pi(x) \,\mathrm{d}{x}

We can now sample X^{(i)} \sim \pi({x}) repeatedly to obtain

\displaystyle   {\mathbb{E}}(f({x}) \,\vert\, y_1, \ldots y_T) \approx \frac{1}{ZN}\sum_1^N  f({X^{(i)}}) \frac{p(y_1, \ldots y_T \, \vert \, {X^{(i)}})p({X^{(i)}})}                              {\pi({X^{(i)}})} =  \sum_1^N w_if({X^{(i)}})

where the weights w_i are defined as before by

\displaystyle   w_i = \frac{1}{ZN} \frac{p(y_1, \ldots y_T \, \vert \, {X^{(i)}})p({X^{(i)}})}                          {\pi({X^{(i)}})}

We follow Alex Cook and use the example from (Rerks-Ngarm et al. 2009). We take the prior as \sim {\cal{Be}}(1,1) and use {\cal{U}}(0.0,1.0) as the proposal distribution. In this case the proposal and the prior are identical just expressed differently and therefore cancel.

Note that we use the log of the pdf in our calculations otherwise we suffer from (silent) underflow, e.g.,

ghci> pdf (Binomial nv (0.4 :: Double)) xv
  0.0

On the other hand if we use the log pdf form

ghci> logPdf (Binomial nv (0.4 :: Double)) xv
  -3900.8941170876574
> xv, nv :: Int
> xv = 51
> nv = 8197
> sampleUniform :: RVarT (W.Writer [Double]) ()
> sampleUniform = do
>   x <- rvarT StdUniform
>   lift $ W.tell [x]
>   return ()
> runSampler :: RVarT (W.Writer [Double]) () ->
>               Int -> Int -> [Double]
> runSampler sampler seed n =
>   snd $
>   W.runWriter $
>   evalStateT (sample (replicateM n sampler))
>              (pureMT (fromIntegral seed))
> sampleSize :: Int
> sampleSize = 1000
> pv :: [Double]
> pv = runSampler sampleUniform 2 sampleSize
> logWeightsRaw :: [Double]
> logWeightsRaw = map (\p -> logPdf (Beta 1.0 1.0) p +
>                            logPdf (Binomial nv p) xv -
>                            logPdf StdUniform p) pv
> logWeightsMax :: Double
> logWeightsMax = maximum logWeightsRaw
> 
> weightsRaw :: [Double]
> weightsRaw = map (\w -> exp (w - logWeightsMax)) logWeightsRaw
> weightsSum :: Double
> weightsSum = sum weightsRaw
> weights :: [Double]
> weights = map (/ weightsSum) weightsRaw
> meanPv :: Double
> meanPv = sum $ zipWith (*) pv weights
> 
> meanPv2 :: Double
> meanPv2 = sum $ zipWith (\p w -> p * p * w) pv weights
> 
> varPv :: Double
> varPv = meanPv2 - meanPv * meanPv

We get the answer

ghci> meanPv
  6.400869727227364e-3

But if we look at the size of the weights and the effective sample size

ghci> length $ filter (>= 1e-6) weights
  9

ghci> (sum weights)^2 / (sum $ map (^2) weights)
  4.581078458313967

so we may not be getting a very good estimate. Let’s try

> sampleNormal :: RVarT (W.Writer [Double]) ()
> sampleNormal = do
>   x <- rvarT $ Normal meanPv (sqrt varPv)
>   lift $ W.tell [x]
>   return ()
> pvC :: [Double]
> pvC = runSampler sampleNormal 3 sampleSize
> logWeightsRawC :: [Double]
> logWeightsRawC = map (\p -> logPdf (Beta 1.0 1.0) p +
>                             logPdf (Binomial nv p) xv -
>                             logPdf (Normal meanPv (sqrt varPv)) p) pvC
> logWeightsMaxC :: Double
> logWeightsMaxC = maximum logWeightsRawC
> 
> weightsRawC :: [Double]
> weightsRawC = map (\w -> exp (w - logWeightsMaxC)) logWeightsRawC
> weightsSumC :: Double
> weightsSumC = sum weightsRawC
> weightsC :: [Double]
> weightsC = map (/ weightsSumC) weightsRawC
> meanPvC :: Double
> meanPvC = sum $ zipWith (*) pvC weightsC
> meanPvC2 :: Double
> meanPvC2 = sum $ zipWith (\p w -> p * p * w) pvC weightsC
> 
> varPvC :: Double
> varPvC = meanPvC2 - meanPvC * meanPvC

Now the weights and the effective size are more re-assuring

ghci> length $ filter (>= 1e-6) weightsC
  1000

ghci> (sum weightsC)^2 / (sum $ map (^2) weightsC)
  967.113872888872

And we can take more confidence in the estimate

ghci> meanPvC
  6.371225269833208e-3

Bibliography

Rerks-Ngarm, Supachai, Punnee Pitisuttithum, Sorachai Nitayaphan, Jaranit Kaewkungwal, Joseph Chiu, Robert Paris, Nakorn Premsri, et al. 2009. “Vaccination with ALVAC and AIDSVAX to Prevent HIV-1 Infection in Thailand.” New England Journal of Medicine 361 (23) (December 3): 2209–2220. doi:10.1056/nejmoa0908492. http://dx.doi.org/10.1056/nejmoa0908492.

Posted in Bayesian, Haskell, Statistics | Leave a comment

Fun with (Kalman) Filters Part II

Introduction

Suppose we have particle moving in at constant velocity in 1 dimension, where the velocity is sampled from a distribution. We can observe the position of the particle at fixed intervals and we wish to estimate its initial velocity. For generality, let us assume that the positions and the velocities can be perturbed at each interval and that our measurements are noisy.

A point of Haskell interest: using type level literals caught a bug in the mathematical description (one of the dimensions of a matrix was incorrect). Of course, this would have become apparent at run-time but proof checking of this nature is surely the future for mathematicians. One could conceive of writing an implementation of an algorithm or proof, compiling it but never actually running it purely to check that some aspects of the algorithm or proof are correct.

The Mathematical Model

We take the position as x_i and the velocity v_i:

\displaystyle   \begin{aligned}  x_i &= x_{i-1} + \Delta T v_{i-1} + \psi^{(x)}_i \\  v_i &= v_{i-1} + \psi^{(v)}_i \\  y_i &= a_i x_i + \upsilon_i  \end{aligned}

where \psi^{(x)}_i, \psi^{(v)}_i and \upsilon_i are all IID normal with means of 0 and variances of \sigma^2_x, \sigma^2_v and \sigma^2_y

We can re-write this as

\displaystyle   \begin{aligned}  \boldsymbol{x}_i &= \boldsymbol{A}_{i-1}\boldsymbol{x}_{i-1} + \boldsymbol{\psi}_{i-1} \\  \boldsymbol{y}_i &= \boldsymbol{H}_i\boldsymbol{x}_i + \boldsymbol{\upsilon}_i  \end{aligned}

where

\displaystyle   \boldsymbol{A}_i =    \begin{bmatrix}      1 & \Delta T\\      0 & 1\\    \end{bmatrix}  ,\quad  \boldsymbol{H}_i =    \begin{bmatrix}      a_i & 0 \\    \end{bmatrix}  ,\quad  \boldsymbol{\psi}_i \sim {\cal{N}}\big(0,\boldsymbol{\Sigma}^{(x)}_i\big)  ,\quad  \boldsymbol{\Sigma}^{(x)}_i =    \begin{bmatrix}      \sigma^2_{x} & 0\\      0 & \sigma^2_{v} \\    \end{bmatrix}  ,\quad  \boldsymbol{\upsilon}_i \sim {\cal{N}}\big(0,\boldsymbol{\Sigma}^{(y)}_i\big)  ,\quad  \boldsymbol{\Sigma}^{(y)}_i =    \begin{bmatrix}      \sigma^2_{z} \\    \end{bmatrix}

Let us denote the mean and variance of \boldsymbol{X}_i\,\vert\,\boldsymbol{Y}_{i-1} as \hat{\boldsymbol{x}}^\flat_i and \hat{\boldsymbol{\Sigma}}^\flat_i respectively and note that

\displaystyle   \begin{aligned}  {\boldsymbol{Y}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{H}_i\boldsymbol{X}_i\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Upsilon}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{H}_i\boldsymbol{X}_i\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Upsilon}_i}  \end{aligned}

Since {\boldsymbol{X}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} and {\boldsymbol{Y}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} are jointly Gaussian and recalling that ({\hat{\boldsymbol{\Sigma}}^\flat_i})^\top = \hat{\boldsymbol{\Sigma}}^\flat_i as covariance matrices are symmetric, we can calculate their mean and covariance matrix as

\displaystyle   \begin{bmatrix}      \hat{\boldsymbol{x}}^\flat_i \\      \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i  \end{bmatrix}  ,\quad  \begin{bmatrix}      \hat{\boldsymbol{\Sigma}}^\flat_i & \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top \\       \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i & \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\  \end{bmatrix}

We can now use standard formulæ which say if

\displaystyle   \begin{bmatrix}      \boldsymbol{X} \\      \boldsymbol{Y}  \end{bmatrix}  \sim  {\cal{N}}  \begin{bmatrix}  \begin{bmatrix}      \boldsymbol{\mu}_x \\      \boldsymbol{\mu}_y  \end{bmatrix}  &  ,  &  \begin{bmatrix}      \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\      \boldsymbol{\Sigma}^\top_{xy} & \boldsymbol{\Sigma}_y  \end{bmatrix}  \end{bmatrix}

then

\displaystyle   \boldsymbol{X}\,\vert\,\boldsymbol{Y}=\boldsymbol{y} \sim {{\cal{N}}\big( \boldsymbol{\mu}_x + \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1}_y(\boldsymbol{y} - \boldsymbol{\mu}_y) , \boldsymbol{\Sigma}_x - \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1}_y\boldsymbol{\Sigma}^\top_{xy}\big)}

and apply this to

\displaystyle   (\boldsymbol{X}_i\,\vert\, \boldsymbol{Y}_{i-1})\,\vert\,(\boldsymbol{Y}_i\,\vert\, \boldsymbol{Y}_{i-1})

to give

\displaystyle   \boldsymbol{X}_i\,\vert\, \boldsymbol{Y}_{i} = \boldsymbol{y}_i  \sim  {{\cal{N}}\big( \hat{\boldsymbol{x}}^\flat_i + \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top  \big(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i\big)^{-1}  (\boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i) , \hat{\boldsymbol{\Sigma}}^\flat_i - \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i)^{-1}\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i\big)}

This is called the measurement update; more explicitly

\displaystyle   \begin{aligned}  \hat{\boldsymbol{x}}^i &\triangleq  \hat{\boldsymbol{x}}^\flat_i +  \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top  \big(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i\big)^{-1}  (\boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i) \\  \hat{\boldsymbol{\Sigma}}_i &\triangleq  {\hat{\boldsymbol{\Sigma}}^\flat_i - \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i)^{-1}\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i}  \end{aligned}

Sometimes the measurement residual \boldsymbol{v}_i, the measurement prediction covariance \boldsymbol{S}_i and the filter gain \boldsymbol{K}_i are defined and the measurement update is written as

\displaystyle   \begin{aligned}  \boldsymbol{v}_i & \triangleq  \boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i \\  \boldsymbol{S}_i & \triangleq  \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\  \boldsymbol{K}_i & \triangleq \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top\boldsymbol{S}^{-1}_i \\  \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \boldsymbol{K}_i\boldsymbol{v}_i \\  \hat{\boldsymbol{\Sigma}}_i &\triangleq \hat{\boldsymbol{\Sigma}}^\flat_i - \boldsymbol{K}_i\boldsymbol{S}_i\boldsymbol{K}^\top_i  \end{aligned}

We further have that

\displaystyle   \begin{aligned}  {\boldsymbol{X}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{A}_i\boldsymbol{X}_{i-1}\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Psi}_{i-1}}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{A}_i\boldsymbol{X}_{i-1}\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Psi}_i}  \end{aligned}

We thus obtain the Kalman filter prediction step:

\displaystyle   \begin{aligned}  \hat{\boldsymbol{x}}^\flat_i &=  \boldsymbol{A}_{i-1}\hat{\boldsymbol{x}}_{i-1} \\  \hat{\boldsymbol{\Sigma}}^\flat_i &= \boldsymbol{A}_{i-1}                                       \hat{\boldsymbol{\Sigma}}_{i-1}                                       \boldsymbol{A}_{i-1}^\top                                     + \boldsymbol{\Sigma}^{(x)}_{i-1}  \end{aligned}

Further information can be found in (Boyd 2008), (Kleeman 1996) and (Särkkä 2013).

A Haskell Implementation

The hmatrix now uses type level literals via the DataKind extension in ghc to enforce compatibility of matrix and vector operations at the type level. See here for more details. Sadly a bug in the hmatrix implementation means we can’t currently use this excellent feature and we content ourselves with comments describing what the types would be were it possible to use it.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> module FunWithKalmanPart1a where
> import Numeric.LinearAlgebra.HMatrix hiding ( outer )
> import Data.Random.Source.PureMT
> import Data.Random hiding ( gamma )
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> import Control.Monad.Loops

Let us make our model almost deterministic but with noisy observations.

> stateVariance :: Double
> stateVariance = 1e-6
> obsVariance :: Double
> obsVariance = 1.0

And let us start with a prior normal distribution with a mean position and velocity of 0 with moderate variances and no correlation.

> -- muPrior :: R 2
> muPrior :: Vector Double
> muPrior = vector [0.0, 0.0]
> -- sigmaPrior :: Sq 2
> sigmaPrior :: Matrix Double
> sigmaPrior = (2 >< 2) [ 1e1,   0.0
>                       , 0.0,   1e1
>                       ]

We now set up the parameters for our model as outlined in the preceeding section.

> deltaT :: Double
> deltaT = 0.001
> -- bigA :: Sq 2
> bigA :: Matrix Double
> bigA = (2 >< 2) [ 1, deltaT
>                 , 0,      1
>                 ]
> a :: Double
> a = 1.0
> -- bigH :: L 1 2
> bigH :: Matrix Double
> bigH = (1 >< 2) [ a, 0
>                 ]
> -- bigSigmaY :: Sq 1
> bigSigmaY :: Matrix Double
> bigSigmaY = (1 >< 1) [ obsVariance ]
> -- bigSigmaX :: Sq 2
> bigSigmaX :: Matrix Double
> bigSigmaX = (2 >< 2) [ stateVariance, 0.0
>                      , 0.0,           stateVariance
>                      ]

The implementation of the Kalman filter using the hmatrix package is straightforward.

> -- outer ::  forall m n . (KnownNat m, KnownNat n) =>
> --           R n -> Sq n -> L m n -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer :: Vector Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> [Vector Double]
>          -> [(Vector Double, Matrix Double)]
> outer muPrior sigmaPrior bigH bigSigmaY bigA bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) ys
> 
>     -- update :: (R n, Sq n) -> R m -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) y =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>         -- v :: R m
>         v = y - bigH #> xHatFlat
>         -- bigS :: Sq m
>         bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY
>         -- bigK :: L n m
>         bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS)
>         -- xHat :: R n
>         xHat = xHatFlat + bigK #> v
>         -- bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK)
>         -- xHatFlatNew :: R n
>         xHatFlatNew = bigA #> xHat
>         -- bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX

We create some ranodm data using our model parameters.

> singleSample ::(Double, Double) ->
>                RVarT (W.Writer [(Double, (Double, Double))]) (Double, Double)
> singleSample (xPrev, vPrev) = do
>   psiX <- rvarT (Normal 0.0 stateVariance)
>   let xNew = xPrev + deltaT * vPrev + psiX
>   psiV <- rvarT (Normal 0.0 stateVariance)
>   let vNew = vPrev + psiV
>   upsilon <- rvarT (Normal 0.0 obsVariance)
>   let y = a * xNew + upsilon
>   lift $ W.tell [(y, (xNew, vNew))]
>   return (xNew, vNew)
> streamSample :: RVarT (W.Writer [(Double, (Double, Double))]) (Double, Double)
> streamSample = iterateM_ singleSample (1.0, 1.0)
> samples :: ((Double, Double), [(Double, (Double, Double))])
> samples = W.runWriter (evalStateT (sample streamSample) (pureMT 2))

Here are the actual values of the randomly generated positions.

> actualXs :: [Double]
> actualXs = map (fst . snd) $ take nObs $ snd samples
> test :: [(Vector Double, Matrix Double)]
> test = outer muPrior sigmaPrior bigH bigSigmaY bigA bigSigmaX
>        (map (\x -> vector [x]) $ map fst $ snd samples)

And using the Kalman filter we can estimate the positions.

> estXs :: [Double]
> estXs = map (!!0) $ map toList $ map fst $ take nObs test
> nObs :: Int
> nObs = 1000

And we can see that the estimates track the actual positions quite nicely.

Of course we really wanted to estimate the velocity.

> actualVs :: [Double]
> actualVs = map (snd . snd) $ take nObs $ snd samples
> estVs :: [Double]
> estVs = map (!!1) $ map toList $ map fst $ take nObs test

Bibliography

Boyd, Stephen. 2008. “EE363 Linear Dynamical Systems.” http://stanford.edu/class/ee363.

Kleeman, Lindsay. 1996. “Understanding and Applying Kalman Filtering.” In Proceedings of the Second Workshop on Perceptive Systems, Curtin University of Technology, Perth Western Australia (25-26 January 1996).

Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. Vol. 3. Cambridge University Press.

Posted in Uncategorized | Leave a comment

Fun with (Kalman) Filters Part I

Suppose we wish to estimate the mean of a sample drawn from a normal distribution. In the Bayesian approach, we know the prior distribution for the mean (it could be a non-informative prior) and then we update this with our observations to create the posterior, the latter giving us improved information about the distribution of the mean. In symbols

\displaystyle   p(\theta \,\vert\, x) \propto p(x \,\vert\, \theta)p(\theta)

Typically, the samples are chosen to be independent, and all of the data is used to perform the update but, given independence, there is no particular reason to do that, updates can performed one at a time and the result is the same; nor is the order of update important. Being a bit imprecise, we have

\displaystyle   p(z \,\vert\, x, y) = p(z, x, y)p(x, y) = p(z, x, y)p(x)p(y) =  p((z \,\vert\, x) \,\vert\, y) =  p((z \,\vert\, y) \,\vert\, x)

The standard notation in Bayesian statistics is to denote the parameters of interest as \theta \in \mathbb{R}^p and the observations as x \in \mathbb{R}^n. For reasons that will become apparent in later blog posts, let us change notation and label the parameters as x and the observations as y.

Let us take a very simple example of a prior X \sim {\cal{N}}(0, \sigma^2) where \sigma^2 is known and then sample from a normal distribution with mean x and variance for the i-th sample c_i^2 where c_i is known (normally we would not know the variance but adding this generality would only clutter the exposition unnecessarily).

\displaystyle   p(y_i \,\vert\, x) = \frac{1}{\sqrt{2\pi c_i^2}}\exp\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg)

The likelihood is then

\displaystyle   p(\boldsymbol{y} \,\vert\, x) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi c_i^2}}\exp\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg)

As we have already noted, instead of using this with the prior to calculate the posterior, we can update the prior with each observation separately. Suppose that we have obtained the posterior given i - 1 samples (we do not know this is normally distributed yet but we soon will):

\displaystyle   p(x \,\vert\, y_1,\ldots,y_{i-1}) = {\cal{N}}(\hat{x}_{i-1}, \hat{\sigma}^2_{i-1})

Then we have

\displaystyle   \begin{aligned}  p(x \,\vert\, y_1,\ldots,y_{i}) &\propto p(y_i \,\vert\, x)p(x \,\vert\, y_1,\ldots,y_{i-1}) \\  &\propto \exp-\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg) \exp-\bigg(\frac{(x - \hat{x}_{i-1})^2}{2\hat{\sigma}_{i-1}^2}\bigg) \\  &\propto \exp-\Bigg(\frac{x^2}{c_i^2} - \frac{2xy_i}{c_i^2} + \frac{x^2}{\hat{\sigma}_{i-1}^2} - \frac{2x\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}\Bigg) \\  &\propto \exp-\Bigg( x^2\Bigg(\frac{1}{c_i^2} + \frac{1}{\hat{\sigma}_{i-1}^2}\Bigg) - 2x\Bigg(\frac{y_i}{c_i^2} + \frac{\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}\Bigg)\Bigg)  \end{aligned}

Writing

\displaystyle   \frac{1}{\hat{\sigma}_{i}^2} \triangleq \frac{1}{c_i^2} + \frac{1}{\hat{\sigma}_{i-1}^2}

and then completing the square we also obtain

\displaystyle   \frac{\hat{x}_{i}}{\hat{\sigma}_{i}^2} \triangleq \frac{y_i}{c_i^2} + \frac{\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}

More Formally

Now let’s be a bit more formal about conditional probability and use the notation of \sigma-algebras to define {\cal{F}}_i = \sigma\{Y_1,\ldots, Y_i\} and M_i \triangleq \mathbb{E}(X \,\vert\, {\cal{F}}_i) where Y_i = X + \epsilon_i, X is as before and \epsilon_i \sim {\cal{N}}(0, c_k^2). We have previously calculated that M_i = \hat{x}_i and that {\cal{E}}((X - M_i)^2 \,\vert\, Y_1, \ldots Y_i) = \hat{\sigma}_{i}^2 and the tower law for conditional probabilities then allows us to conclude {\cal{E}}((X - M_i)^2) = \hat{\sigma}_{i}^2. By Jensen’s inequality, we have

\displaystyle   {\cal{E}}(M_i^2) = {\cal{E}}({\cal{E}}(X \,\vert\, {\cal{F}}_i)^2)) \leq  {\cal{E}}({\cal{E}}(X^2 \,\vert\, {\cal{F}}_i))) =  {\cal{E}}(X^2) = \sigma^2

Hence M is bounded in L^2 and therefore converges in L^2 and almost surely to M_\infty \triangleq {\cal{E}}(X \,\vert\, {\cal{F}}_\infty). The noteworthy point is that if M_\infty = X if and only if \hat{\sigma}_i converges to 0. Explicitly we have

\displaystyle   \frac{1}{\hat{\sigma}_i^2} = \frac{1}{\sigma^2} + \sum_{k=1}^i\frac{1}{c_k^2}

which explains why we took the observations to have varying and known variances. You can read more in Williams’ book (Williams 1991).

A Quick Check

We have reformulated our estimation problem as a very simple version of the celebrated Kalman filter. Of course, there are much more interesting applications of this but for now let us try “tracking” the sample from the random variable.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> module FunWithKalmanPart1 (
>     obs
>   , nObs
>   , estimates
>   , uppers
>   , lowers
>   ) where
> 
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> var, cSquared :: Double
> var       = 1.0
> cSquared  = 1.0
> 
> nObs :: Int
> nObs = 100
> createObs :: RVar (Double, [Double])
> createObs = do
>   x <- rvar (Normal 0.0 var)
>   ys <- replicateM nObs $ rvar (Normal x cSquared)
>   return (x, ys)
> 
> obs :: (Double, [Double])
> obs = evalState (sample createObs) (pureMT 2)
> 
> updateEstimate :: (Double, Double) -> (Double, Double) -> (Double, Double)
> updateEstimate (xHatPrev, varPrev) (y, cSquared) = (xHatNew, varNew)
>   where
>     varNew  = recip (recip varPrev + recip cSquared)
>     xHatNew = varNew * (y / cSquared + xHatPrev / varPrev)
> 
> estimates :: [(Double, Double)]
> estimates = scanl updateEstimate (y, cSquared) (zip ys (repeat cSquared))
>   where
>     y  = head $ snd obs
>     ys = tail $ snd obs
> 
> uppers :: [Double]
> uppers = map (\(x, y) -> x + 3 * (sqrt y)) estimates
> 
> lowers :: [Double]
> lowers = map (\(x, y) -> x - 3 * (sqrt y)) estimates

Bibliography

Williams, David. 1991. Probability with Martingales. Cambridge University Press.

Posted in Uncategorized | 2 Comments

Gibbs Sampling in Haskell

This is really intended as a draft chapter for our book. Given the diverse natures of the intended intended audiences, it is probably a bit light on explanation of the Haskell (use of monad transformers) for those with a background in numerical methods. It is hoped that the explanation of the mathematics is adequate for those with a background in Haskell but not necessarily in numerical methods. As always, any feedback is gratefully accepted.

Introduction

Imagine an insect, a grasshopper, trapped on the face of a clock which wants to visit each hour an equal number of times. However, there is a snag: it can only see the value of the hour it is on and the value of the hours immediately anti-clockwise and immediately clockwise. For example, if it is standing on 5 then it can see the 5, the 4, and the 6 but no others.

It can adopt the following strategy: toss a fair coin and move anti-clockwise for a head and move clockwise for a tail. Intuition tells us that over a large set of moves the grasshopper will visit each hour (approximately) the same number of times.

Can we confirm our intuition somehow? Suppose that the strategy has worked and the grasshopper is now to be found with equal probability on any hour. Then at the last jump, the grasshopper must either have been at the hour before the one it is now on or it must have been at the hour after the one it is now on. Let us denote the probability that the grasshopper is on hour n by \pi(n) and the (conditional) probability that the grasshopper jumps to state n given it was in state m by p(n \, |\, m). Then we have

\displaystyle   \pi'(n) = p(n \, |\, n - 1)\pi(n - 1) + p(n \, |\, n + 1)\pi(n + 1)

Substituting in where N is a normalising constant (12 in this case) we obtain

\displaystyle   \pi'(n) = \frac{1}{2}\frac{1}{N} + \frac{1}{2}\frac{1}{N} = \frac{1}{N}

This tells us that the required distribution is a fixed point of the grasshopper’s strategy. But does the strategy actually converge to the fixed point? Let us perform an experiment.

First we import some modules from hmatrix.

> {-# LANGUAGE FlexibleContexts #-}
> module Chapter1 where
> import Data.Packed.Matrix
> import Numeric.LinearAlgebra.Algorithms
> import Numeric.Container
> import Data.Random
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> import qualified Control.Monad.Loops as ML
> import Data.Random.Source.PureMT

Let us use a clock with 5 hours to make the matrices sufficiently small to fit on one page.

Here is the strategy encoded as a matrix. For example the first row says jump to position 1 with probablity 0.5 or jump to position 5 with probability 0.5.

> eqProbsMat :: Matrix Double
> eqProbsMat = (5 >< 5)
>         [ 0.0, 0.5, 0.0, 0.0, 0.5
>         , 0.5, 0.0, 0.5, 0.0, 0.0
>         , 0.0, 0.5, 0.0, 0.5, 0.0
>         , 0.0, 0.0, 0.5, 0.0, 0.5
>         , 0.5, 0.0, 0.0, 0.5, 0.0
>         ]

We suppose the grasshopper starts at 1 o’clock.

> startOnOne :: Matrix Double
> startOnOne = ((1 >< 5) [1.0, 0.0, 0.0, 0.0, 0.0])

If we allow the grasshopper to hop 1000 times then we see that it is equally likely to be found on any hour hand with a 20% probability.

ghci> eqProbsMat
  (5><5)
   [ 0.0, 0.5, 0.0, 0.0, 0.5
   , 0.5, 0.0, 0.5, 0.0, 0.0
   , 0.0, 0.5, 0.0, 0.5, 0.0
   , 0.0, 0.0, 0.5, 0.0, 0.5
   , 0.5, 0.0, 0.0, 0.5, 0.0 ]

ghci> take 1 $ drop 1000  $ iterate (<> eqProbsMat) startOnOne
  [(1><5)
   [ 0.20000000000000007, 0.2, 0.20000000000000004, 0.20000000000000004, 0.2 ]]

In this particular case, the strategy does indeed converge.

Now suppose the grasshopper wants to visit each hour in proportion the value of the number on the hour. Lacking pen and paper (and indeed opposable thumbs), it decides to adopt the following strategy: toss a fair coin as in the previous strategy but only move if the number is larger than the one it is standing on; if, on the other hand, the number is smaller then choose a number at random from between 0 and 1 and move if this value is smaller than the ratio of the proposed hour and the hour on which it is standing otherwise stay put. For example, if the grasshopper is standing on 5 and gets a tail then it will move to 6 but if it gets a head then four fifths of the time it will move to 4 but one fifth of the time it will stay where it is.

Suppose that the strategy has worked (it is not clear that is has) and the grasshopper is now to be found at 12 o’clock 12 times as often as at 1 o’clock, at 11 o’clock 11 times as often as at 1 o’clock, etc. Then at the last jump, the grasshopper must either have been at the hour before the one it is now on, the hour after the one it is now on or the same hour it is now on. Let us denote the probability that the grasshopper is on hour n by \pi(n).

\displaystyle   \pi'(n) = p(n \, |\, n - 1)\pi(n - 1) + p(n \, |\, n)\pi(n) + p(n \, |\, n + 1)\pi(n + 1)

Substituting in at 4 say

\displaystyle   \begin{aligned}  \pi'(4) &= \frac{1}{2}\pi(3) +             \frac{1}{2}\frac{1}{4}\pi(4) +             \frac{1}{2}\frac{4}{5}\pi(5) \\          &= \frac{1}{2}\bigg(\frac{3}{N} + \frac{1}{4}\frac{4}{N} + \frac{4}{5}\frac{5}{N}\bigg) \\          &= \frac{1}{N}\frac{8}{2} \\          &= \frac{4}{N} \\          &= \pi(4)  \end{aligned}

The reader can check that this relationship holds for all other hours. This tells us that the required distribution is a fixed point of the grasshopper’s strategy. But does this strategy actually converge to the fixed point?

Again, let us use a clock with 5 hours to make the matrices sufficiently small to fit on one page.

Here is the strategy encoded as a matrix. For example the first row says jump to position 1 with probablity 0.5 or jump to position 5 with probability 0.5.

> incProbsMat :: Matrix Double
> incProbsMat = scale 0.5 $
>   (5 >< 5)
>     [ 0.0,         1.0,     0.0,        0.0, 1.0
>     , 1.0/2.0, 1.0/2.0,     1.0,        0.0, 0.0
>     , 0.0,     2.0/3.0, 1.0/3.0,        1.0, 0.0
>     , 0.0,         0.0, 3.0/4.0,    1.0/4.0, 1.0
>     , 1.0/5.0,     0.0,     0.0,    4.0/5.0, 1.0/5.0 + 4.0/5.0
>     ]

We suppose the grasshopper starts at 1 o’clock.

If we allow the grasshopper to hop 1000 times then we see that it is equally likely to be found on any hour hand n with a probability of n times the probability of being found on 1.

ghci> incProbsMat
  (5><5)
   [  0.0,                0.5,                 0.0,   0.0, 0.5
   , 0.25,               0.25,                 0.5,   0.0, 0.0
   ,  0.0, 0.3333333333333333, 0.16666666666666666,   0.5, 0.0
   ,  0.0,                0.0,               0.375, 0.125, 0.5
   ,  0.1,                0.0,                 0.0,   0.4, 0.5 ]

ghci> take 1 $ drop 1000  $ iterate (<> incProbsMat) startOnOne
  [(1><5)
   [ 6.666666666666665e-2, 0.1333333333333333, 0.19999999999999996, 0.2666666666666666, 0.33333333333333326 ]]

In this particular case, the strategy does indeed converge.

Surprisingly, this strategy produces the desired result and is known as the Metropolis Algorithm. What the grasshopper has done is to construct a (discrete) Markov Process which has a limiting distribution (the stationary distribution) with the desired feature: sampling from this process will result in each hour being sampled in proportion to its value.

Markov Chain Theory

Let us examine what is happening in a bit more detail.

The grasshopper has started with a very simple Markov Chain: one which jumps clockwise or anti-clockwise with equal probability and then modified it. But what is a Markov Chain?

A time homogeneous Markov chain is a countable sequence of random variables
X_0, X_1, \ldots such that

\displaystyle   \mathbb{P} (X_{n+1} = j \,|\, X_0 = i_0, X_1 = i_1, \dots X_n = i) = \mathbb{P} (X_{n+1} = j \,|\, X_n = i)

We sometimes say that a Markov Chain is discrete time stochastic process with the above property.

So the very simple Markov Chain can be described by

\displaystyle   q(i, j) =  \begin{cases}  \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = \frac{1}{2} & \text{if } j = i + 1 \mod N \\  \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = \frac{1}{2} & \text{if } j = i - 1 \mod N \\  \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = 0 & \text{otherwise }  \end{cases}

The grasshopper knows that \pi(i) = i/N so it can calculate \pi(j)/\pi(i) = j/i without knowing N. This is important because now, without knowing N, the grasshopper can evaluate

\displaystyle   p(i, j) =  \begin{cases}  q(i,j)\bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j \ne i \\  1 - \sum_{k : k \ne i} q(i,k) \bigg[\frac{\pi(k) q(k,i)}{\pi(i) q(i,k)} \land 1 \bigg] & \text{if } j = i  \end{cases}

where \land takes the maximum of its arguments. Simplifying the above by substituing in the grasshopper’s probabilities and noting that j = i \pm 1 \mod N is somewhat obscure way of saying jump clockwise or anti-clockwise we obtain

\displaystyle   q(i, j) =  \begin{cases}  \frac{1}{2} (\frac{j}{i} \land 1) & \text{if } j \text{ is 1 step clockwise} \\  \frac{1}{2} (\frac{j}{i} \land 1) & \text{if } j \text{ is 1 step anti-clockwise} \\  1 - \frac{1}{2}(\frac{j^c}{i} \land 1) - \frac{1}{2}(\frac{j^a}{i} \land 1) & \text{if } j = i \text{ and } j^c \text{ is one step clockwise and } j^a \text{ is one step anti-clockwise} \\  0 & \text{otherwise}  \end{cases}

The Ergodic Theorem

In most studies of Markov chains, one is interested in whether a chain has a stationary distribution. What we wish to do is take a distribution and create a chain with this distribution as its stationary distribution. We will still need to show that our chain does indeed have the correct stationary distribution and we state the relevant theorem somewhat informally and with no proof.

Theorem

An irreducible, aperiodic and positive recurrent Markov chain has a unique stationary distribution.

Roughly speaking

  • Irreducible means it is possible to get from any state to any other state.

  • Aperiodic means that returning to a state having started at that state occurs at irregular times.

  • Positive recurrent means that the first time to hit a state is finite (for every state and more pedantically except on sets of null measure).

Note that the last condition is required when the state space is infinite – see Skrikant‘s lecture notes for an example and also for a more formal definition of the theorem and its proof.

Algorithm

Let \pi be a probability distribution on the state space \Omega with \pi(i) > 0 for all i and let (Q, \pi_0) be an ergodic Markov chain on \Omega with transition probabilities q(i,j) > 0 (the latter condition is slightly stronger than it need be but we will not need fully general conditions).

Create a new (ergodic) Markov chain with transition probabilities

\displaystyle   p_{ij} =  \begin{cases}  q(i,j)\bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j \ne i \\  1 - \sum_{k : k \ne i} q(i,k) \bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j = i  \end{cases}

where \land takes the maximum of its arguments.

Calculate the value of interest on the state space e.g. the total magnetization for each step produced by this new chain.

Repeat a sufficiently large number of times and take the average. This gives the estimate of the value of interest.

Convergence

Let us first note that the Markov chain produced by this algorithm almost trivially satisfies the detailed balance condition, for example,

\displaystyle   \begin{aligned}  \pi(i) q(i,j)\bigg[\frac{\pi(j) q(j, i)}{\pi(i)q(i,j)} \land 1\bigg]  &= \pi(i)q(i,j) \land \pi(j)q(j,i) \\  &= \pi(j)q(j,i)\bigg[\frac{\pi(i) q(i, j)}{\pi(j)q(j,i)} \land 1\bigg]  \end{aligned}

Secondly since we have specified that (Q, \pi_0) is ergodic then clearly (P, \pi_0) is also ergodic (all the transition probabilities are > 0).

So we know the algorithm will converge to the unique distribution we specified to provide estimates of values of interest.

Gibbs Sampling

Random Scan

For simplicity let us consider a model with two parameters and that we sample from either parameter with equal probability. In this sampler, We update the parameters in a single step.

\displaystyle   \begin{cases}  \text{Sample } \theta_1^{(i+1)} \sim \pi(\theta_1 \,\big|\, \theta_2^{(i)}) & \text{with probability } \frac{1}{2} \\  \text{Sample } \theta_2^{(i+1)} \sim \pi(\theta_2 \,\big|\, \theta_1^{(i)}) & \text{with probability } \frac{1}{2}  \end{cases}

The transition density kernel is then given by

\displaystyle   q\big(\boldsymbol{\theta}^{(i+1)}, \boldsymbol{\theta}^{(i)}\big) =  \frac{1}{2}\pi(\theta_1^{(i+1)} \,\big|\, \theta_2^{(i)})\delta({\theta_2^{(i)},\theta_2^{(i+1)}}) +  \frac{1}{2}\pi(\theta_2^{(i+1)} \,\big|\, \theta_1^{(i)})\delta({\theta_1^{(i)},\theta_1^{(i+1)}})

where \delta is the Dirac delta function.

Detailed balance

This sampling scheme satisifies the detailed balance condition. We have

\displaystyle   \begin{aligned}  \pi(\theta_1, \theta_2)  \bigg[  \frac{1}{2}\pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) +  \frac{1}{2}\pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\  \frac{1}{2}\bigg[\pi(\theta_1, \theta_2)  \pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) +  \pi(\theta_1, \theta_2)  \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\  \frac{1}{2}\bigg[\pi(\theta_1, \theta_2')  \pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) +  \pi(\theta_1', \theta_2)  \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\  \frac{1}{2}\bigg[  \pi(\theta_2')\pi(\theta_1 \,\big|\, \theta_2')  \frac{1}{2}\pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) +  \pi(\theta_1')\pi(\theta_2 \,\big|\, \theta_1')  \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})  \bigg] &= \\  \frac{1}{2}\bigg[  \pi(\theta_1', \theta_2')\pi(\theta_1 \,\big|\, \theta_2')  \delta({\theta_2',\theta_2}) +  \pi(\theta_1', \theta_2')\pi(\theta_2 \,\big|\, \theta_1')  \delta({\theta_1',\theta_1})  \bigg] &= \\  \pi(\theta_1', \theta_2')\bigg[  \frac{1}{2}\pi(\theta_1 \,\big|\, \theta_2')  \delta({\theta_2',\theta_2}) +  \frac{1}{2}\pi(\theta_2 \,\big|\, \theta_1')  \delta({\theta_1',\theta_1})  \bigg] &  \end{aligned}

In other words

\displaystyle   \pi\big({\boldsymbol{\theta}}\big)q\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big) =  \pi\big({\boldsymbol{\theta'}}\big)q\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big)

Hand waving slightly, we can see that this scheme satisfies the premises of the ergodic theorem and so we can conclude that there is a unique stationary distribution and \pi must be that distribution.

Systematic Scan

Most references on Gibbs sampling do not describe the random scan but instead something called a systematic scan.

Again for simplicity let us consider a model with two parameters. In this sampler, we update the parameters in two steps.

\displaystyle   \begin{aligned}  \text{Sample } \theta_1^{(i+1)} & \sim & \pi(\theta_1 \,\big|\, \theta_2^{(i)}) \\  \text{Sample } \theta_2^{(i+1)} & \sim & \pi(\theta_2 \,\big|\, \theta_1^{(i+1)})  \end{aligned}

We observe that this is not time-homegeneous; at each step the transition matrix flips between the two transition matrices given by the individual steps. Thus although, as we show below, each individual transtion satisifies the detailed balance condition, we cannot apply the ergodic theorem as it only applies to time-homogeneous processes.

The transition density kernel is then given by

\displaystyle   q\big(\boldsymbol{\theta}^{(i)}, \boldsymbol{\theta}^{(i+1)}\big) =  q_1\big(\boldsymbol{\theta}^{(i)}, \tilde{\boldsymbol{\theta}}\big)  q_2\big(\tilde{\boldsymbol{\theta}}, \boldsymbol{\theta}^{(i+1)}\big)

where \tilde{\boldsymbol{\theta}} = (\theta_1^{(i+1)}, \theta_2^{(i)})^\top.

Thus

\displaystyle   q\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) =  \pi(\theta_1' \,\big|\, \theta_2)  \pi(\theta_2' \,\big|\, \theta_1')

Detailed balance

Suppose that we have two states \boldsymbol{\theta} = (\theta_1, \theta_2)^\top and \boldsymbol{\theta}' = (\theta_1', \theta_2')^\top and that \theta_2 \neq \theta_2'. Then q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) = 0. Trivially we have

\displaystyle   \pi\big({\boldsymbol{\theta}}\big)q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) =  \pi\big({\boldsymbol{\theta'}}\big)q_1\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)

Now suppose that \theta_2 = \theta_2'

\displaystyle   \begin{aligned}  \pi(\theta_1, \theta_2)q_1((\theta_1, \theta_2), (\theta_1', \theta_2)) & =  \pi(\theta_1, \theta_2)\pi(\theta_1' \,\big|\, \theta_2) \\  & = \pi(\theta_1 \,\big|\, \theta_2)\pi(\theta_1', \theta_2) \\  & = \pi(\theta_1 \,\big|\, \theta_2')\pi(\theta_1', \theta_2') \\  & = \pi(\theta_1', \theta_2')q_1((\theta_1', \theta_2), (\theta_1, \theta_2))  \end{aligned}

So again we have

\displaystyle   \pi\big({\boldsymbol{\theta}}\big)q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) =  \pi\big({\boldsymbol{\theta'}}\big)q_1\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)

Similarly we can show

\displaystyle   \pi\big({\boldsymbol{\theta}}\big)q_2\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) =  \pi\big({\boldsymbol{\theta'}}\big)q_2\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)

But note that

\displaystyle   \begin{aligned}  \pi(\theta_1, \theta_2)  q_1((\theta_1, \theta_2), (\theta_1', \theta_2))  q_2((\theta_1', \theta_2), (\theta_1', \theta_2')) & =  \pi(\theta_1, \theta_2)  \pi(\theta_1' \,\big|\, \theta_2)  \pi(\theta_2' \,\big|\, \theta_1') \\  & = \pi(\theta_1', \theta_2) \pi(\theta_1 \,\big|\, \theta_2) \pi(\theta_2' \,\big|\, \theta_1') \\  & = \pi(\theta_1' \,\big|\, \theta_2) \pi(\theta_1 \,\big|\, \theta_2) \pi(\theta_2', \theta_1')  \end{aligned}

whereas

\displaystyle   \begin{aligned}  \pi(\theta_1', \theta_2')  q_1((\theta_1', \theta_2'), (\theta_1, \theta_2'))  q_2((\theta_1, \theta_2'), (\theta_1, \theta_2)) & =  \pi(\theta_1', \theta_2')  \pi(\theta_1 \,\big|\, \theta_2')  \pi(\theta_2 \,\big|\, \theta_1) \\  & = \pi(\theta_1, \theta_2') \pi(\theta_1' \,\big|\, \theta_2') \pi(\theta_2 \,\big|\, \theta_1) \\  & = \pi(\theta_2' \,\big|\, \theta_1) \pi(\theta_1' \,\big|\, \theta_2') \pi(\theta_2, \theta_1)  \end{aligned}

and these are not necessarily equal.

So the detailed balance equation is not satisfied, another sign that we cannot appeal to the ergodic theorem.

An Example: The Bivariate Normal

Let us demonstrate the Gibbs sampler with a distribution which we actually know: the bivariate normal.

\displaystyle   \begin{bmatrix}  \theta_1 \\  \theta_2  \end{bmatrix}  \bigg| y \sim  N \begin{bmatrix}  \begin{bmatrix}  \theta_1 \\  \theta_2  \end{bmatrix} &  \begin{bmatrix}  1 & \rho \\  \rho & 1  \end{bmatrix}  \end{bmatrix}

The conditional distributions are easily calculated to be

\displaystyle   \begin{aligned}  \theta_1 \,\vert\, \theta_2, y &\sim {\cal{N}}(y_1 + \rho(\theta_2 - y_2), 1 - \rho^2) \\  \theta_2 \,\vert\, \theta_1, y &\sim {\cal{N}}(y_2 + \rho(\theta_1 - y_1), 1 - \rho^2)  \end{aligned}

Let’s take a correlation of 0.8, a data point of (0.0, 0.0) and start the chain at (2.5, 2.5).

> rho :: Double
> rho = 0.8
> 
> y :: (Double, Double)
> y = (0.0, 0.0)
> 
> y1, y2 :: Double
> y1 = fst y
> y2 = snd y
> 
> initTheta :: (Double, Double)
> initTheta = (2.5, 2.5)

We pre-calculate the variance needed for the sampler.

> var :: Double
> var = 1.0 - rho^2

In Haskell and in the random-fu package, sampling from probability distributions is implemented as a monad. We sample from the relevant normal distributions and keep the trajectory using a writer monad.

> gibbsSampler :: Double -> RVarT (W.Writer [(Double,Double)]) Double
> gibbsSampler oldTheta2 = do
>   newTheta1 <- rvarT (Normal (y1 + rho * (oldTheta2 - y2)) var)
>   lift $ W.tell [(newTheta1, oldTheta2)]
>   newTheta2 <- rvarT (Normal (y2 + rho * (newTheta1 - y1)) var)
>   lift $ W.tell [(newTheta1, newTheta2)]
>   return $ newTheta2

It is common to allow the chain to “burn in” so as to “forget” its starting position. We arbitrarily burn in for 10,000 steps.

> burnIn :: Int
> burnIn = 10000

We sample repeatedly from the sampler using the monadic form of iterate. Running the monadic stack is slightly noisy but nonetheless straightforward. We use mersenne-random-pure64 (albeit indirectly via random-source) as our source of entropy.

> runMCMC :: Int -> [(Double, Double)]
> runMCMC n =
>   take n $
>   drop burnIn $
>   snd $
>   W.runWriter (evalStateT (sample (ML.iterateM_ gibbsSampler (snd initTheta))) (pureMT 2))

We can look at the trajectory of our sampler for various run lengths.

For bigger sample sizes, plotting the distribution sampled re-assures us that we are indeed sampling from a bivariate normal distribution as the theory predicted.

Applications to Bayesian Statistics

Some of what is here and here excluding JAGS and STAN (after all this is a book about Haskell).

Applications to Physics

Applications to Physics

Most of what is here.

Posted in Bayesian, Haskell, Statistics | Leave a comment

A Monoidal Category Example

I have never felt entirely comfortable with Haskell’s arrows and skimming the literature for their categorical basis didn’t reveal anything as straightforward as monads or applicatives. It did however lead me to start thinking about monoidal categories and since I always want an example, I thought I would write up Hilbert spaces.

Let H_1 and H_2 be Hilbert spaces then as vector spaces we can form the tensor product H_1 \otimes H_2. The tensor product can be defined as the free vector space on H_1 and H_2 as sets (that is purely formal sums of (u,v)) modulo a relation \sim defined by

\displaystyle   \begin{aligned}  (u_1,v) + (u_2,v) \sim (u_1 + u_2,v) \\  (u,v_1) + (u,v_2) \sim (u,v_1 + v_2) \\  c(u,v) \sim (cu,v) \\  c(u,v) \sim (u,cv)  \end{aligned}

Slightly overloading notation, we can define an inner product on the tensored space by

\displaystyle   \langle u_1 \otimes v_1, u_2 \otimes v_2\rangle =  \langle u_1, v_1 \rangle \langle u_2, v_2\rangle

Of course this might not be complete so we define the tensor product on Hilbert spaces to be the completion of this inner product.

For Hilbert spaces to form a monoidal category, we take the arrows (in the categorical sense) to be linear continuous maps and the bifunctor to be the tensor product. We also need an identity object I which we take to be \mathbb{R} considered as a Hilbert space. We should check the coherence conditions but the associativity of the tensor product and the fact that our Hilbert spaces are over the \mathbb{R} make this straightforward.

Now for some slightly interesting properties of this category.

  • The tensor product is not the product in the categorical sense. If \{u_i\} and \{v_i\} are (orthonormal) bases for H_1 and H_2 then \{u_i \otimes v_j\} is a (orthonormal) basis for H_1 \otimes H_2. Thus a linear combination of basis vectors in the tensor product cannot be expressed as the tensor of basis vectors in the component spaces.

  • There is no diagonal arrow \Delta : X \rightarrow X \otimes X. Suppose there were such a diagonal then for arbitrary \lambda we would have \Delta(\lambda u) = (\lambda u) \otimes (\lambda u) = \lambda^2 (u \otimes u) and since \Delta must be linear this is not possible.

Presumably the latter is equivalent to the statement in quantum mechanics of “no cloning”.

Posted in category theory | Leave a comment

Hölder’s and Minkowski’s Inequalities

Introduction

I have seen Hölder’s inequality and Minkowski’s inequality proved in several ways but this seems the most perspicuous (to me at any rate).

Young’s Inequality

If a, b \ge 0 and p,q \ge 1 such that

\displaystyle   \frac{1}{p} + \frac{1}{q} = 1

then

\displaystyle   ab \le \frac{a^p}{p} + \frac{b^q}{q}

A p and q satisfying the premise are known as conjugate indices.

Proof

Since \log is convex we have

\displaystyle   t\log{x} + (1 - t)\log{y} \le \log{(tx + (1 - t)y)}

Substituting in appropriate values gives

\displaystyle   \frac{1}{p}\log{a^p} + \frac{1}{q}\log{b^q} \le \log{\bigg(\frac{a^p}{p} + \frac{b^q}{q}\bigg)}

or

\displaystyle   \log{a} + \log{b} \le \log{\bigg(\frac{a^p}{p} + \frac{b^q}{q}\bigg)}

Now take exponents.

\blacksquare

Hölders’s Inequality

Let p and q be conjugate indices with 1 < p < \infty and let f \in L^p(\Omega) and g \in L^q(\Omega) then fg \in L^1(\Omega) and

\displaystyle   \|fg\|_{L^1} \le \|f\|_{L^p}\|g\|_{L^q}

Proof

By Young’s inequality

\displaystyle   \int_\Omega \frac{|f(x)|}{\|f\|_{L^p}} \frac{|g(x)|}{\|g\|_{L^q}} \le  \int_\Omega \frac{1}{p}\frac{|f(x)|^p}{\|f\|_{L^p}^p} +              \frac{1}{q}\frac{|g(x)|^q}{\|g\|_{L^q}^q} =  \frac{1}{p} + \frac{1}{q} = 1

\blacksquare

By applying a counting measure to \Omega we also obtain

\displaystyle   \sum |x_i y_i| \le \big(\sum |x_i|^p\big)^{1/p} \big(\sum |y_i|^q\big)^{1/q}

Minkowski’s Inequality

\displaystyle   \|f + g\|_{L^p} \le \|f\|_{L^p} + \|g\|_{L^p}

Proof

By Hölder’s inequality

\displaystyle   \int_\Omega |f + g|^p \le \int_\Omega |f||f + g|^{p-1} +                            \int_\Omega |g||f + g|^{p-1}                            \le \|f\|_{L^p}A + \|g\|_{L^p}A

where

\displaystyle   A = \||f + g|^{p-1}\|_{L^q} = \big(\int_\Omega |f(x) + g(x)|^p\big)^{1/q}

and A is finite since L^p is a vector space.

\blacksquare

Posted in Probability | Leave a comment