It’s possible to Gibbs sampling in most languages and since I am doing some work in R and some work in Haskell, I thought I’d present a simple example in both languages: estimating the mean from a normal distribution with unknown mean and variance. Although one can do Gibbs sampling directly in R, it is more common to use a specialised language such as JAGS or STAN to do the actual sampling and do pre-processing and post-processing in R. This blog post presents implementations in native R, JAGS and STAN as well as Haskell.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE NoMonomorphismRestriction #-}
> module Gibbs (
> main
> , m
> , Moments(..)
> ) where
>
> import qualified Data.Vector.Unboxed as V
> import qualified Control.Monad.Loops as ML
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram ( asList )
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List
> import qualified Control.Foldl as L
>
> import Diagrams.Backend.Cairo.CmdLine
>
> import LinRegAux
>
> import Diagrams.Backend.CmdLine
> import Diagrams.Prelude hiding ( sample, render )
The length of our chain and the burn-in.
> nrep, nb :: Int
> nb = 5000
> nrep = 105000
Data generated from .
> xs :: [Double]
> xs = [
> 11.0765808082301
> , 10.918739177542
> , 15.4302462747137
> , 10.1435649220266
> , 15.2112705014697
> , 10.441327659703
> , 2.95784054883142
> , 10.2761068139607
> , 9.64347295100318
> , 11.8043359297675
> , 10.9419989262713
> , 7.21905367667346
> , 10.4339807638017
> , 6.79485294803006
> , 11.817248658832
> , 6.6126710570584
> , 12.6640920214508
> , 8.36604701073303
> , 12.6048485320333
> , 8.43143879537592
> ]
For a multi-parameter situation, Gibbs sampling is a special case of Metropolis-Hastings in which the proposal distributions are the posterior conditional distributions.
Referring back to the explanation of the metropolis algorithm, let us describe the state by its parameters and the conditional posteriors by where then
where we have used the rules of conditional probability and the fact that
Thus we always accept the proposed jump. Note that the chain is not in general reversible as the order in which the updates are done matters.
It is fairly standard to use an improper prior
The likelihood is
re-writing in terms of precision
Thus the posterior is
We can re-write the sum in terms of the sample mean and variance using
Thus the conditional posterior for is
which we recognise as a normal distribution with mean of and a variance of .
The conditional posterior for is
which we recognise as a gamma distribution with a shape of and a scale of
In this particular case, we can calculate the marginal posterior of analytically. Writing we have
Finally we can calculate
This is the non-standardized Student’s t-distribution .
Alternatively the marginal posterior of is
where is the standard t distribution with degrees of freedom.
Following up on a comment from a previous blog post, let us try using the foldl package to calculate the length, the sum and the sum of squares traversing the list only once. An improvement on creating your own strict record and using foldl’ but maybe it is not suitable for some methods e.g. calculating the skewness and kurtosis incrementally, see below.
> x2Sum, xSum, n :: Double
> (x2Sum, xSum, n) = L.fold stats xs
> where
> stats = (,,) <$>
> (L.premap (\x -> x * x) L.sum) <*>
> L.sum <*>
> L.genericLength
And re-writing the sample variance using
we can then calculate the sample mean and variance using the sums we have just calculated.
> xBar, varX :: Double
> xBar = xSum / n
> varX = n * (m2Xs - xBar * xBar) / (n - 1)
> where m2Xs = x2Sum / n
In random-fu, the Gamma distribution is specified by the rate paratmeter, .
> beta, initTau :: Double
> beta = 0.5 * n * varX
> initTau = evalState (sample (Gamma (n / 2) beta)) (pureMT 1)
Our sampler takes an old value of and creates new values of and .
> gibbsSampler :: MonadRandom m => Double -> m (Maybe ((Double, Double), Double))
> gibbsSampler oldTau = do
> newMu <- sample (Normal xBar (recip (sqrt (n * oldTau))))
> let shape = 0.5 * n
> scale = 0.5 * (x2Sum + n * newMu^2 - 2 * n * newMu * xBar)
> newTau <- sample (Gamma shape (recip scale))
> return $ Just ((newMu, newTau), newTau)
From which we can create an infinite stream of samples.
> gibbsSamples :: [(Double, Double)]
> gibbsSamples = evalState (ML.unfoldrM gibbsSampler initTau) (pureMT 1)
As our chains might be very long, we calculate the mean, variance, skewness and kurtosis using an incremental method.
> data Moments = Moments { mN :: !Double
> , m1 :: !Double
> , m2 :: !Double
> , m3 :: !Double
> , m4 :: !Double
> }
> deriving Show
> moments :: [Double] -> Moments
> moments xs = foldl' f (Moments 0.0 0.0 0.0 0.0 0.0) xs
> where
> f :: Moments -> Double -> Moments
> f m x = Moments n' m1' m2' m3' m4'
> where
> n = mN m
> n' = n + 1
> delta = x - (m1 m)
> delta_n = delta / n'
> delta_n2 = delta_n * delta_n
> term1 = delta * delta_n * n
> m1' = m1 m + delta_n
> m4' = m4 m +
> term1 * delta_n2 * (n'*n' - 3*n' + 3) +
> 6 * delta_n2 * m2 m - 4 * delta_n * m3 m
> m3' = m3 m + term1 * delta_n * (n' - 2) - 3 * delta_n * m2 m
> m2' = m2 m + term1
In order to examine the posterior, we create a histogram.
> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
> where
> lower = xBar - 2.0 * sqrt varX
> upper = xBar + 2.0 * sqrt varX
And fill it with the specified number of samples preceeded by a burn-in.
> hist :: Histogram V.Vector BinD Double
> hist = fillBuilder hb (take (nrep - nb) $ drop nb $ map fst gibbsSamples)
Now we can plot this.
And calculate the skewness and kurtosis.
> m :: Moments
> m = moments (take (nrep - nb) $ drop nb $ map fst gibbsSamples)
ghci> import Gibbs
ghci> putStrLn $ show $ (sqrt (mN m)) * (m3 m) / (m2 m)**1.5
8.733959917065126e-4
ghci> putStrLn $ show $ (mN m) * (m4 m) / (m2 m)**2
3.451374739494607
We expect a skewness of 0 and a kurtosis of for . Not too bad.
JAGS is a mature, declarative, domain specific language for building Bayesian statistical models using Gibbs sampling.
Here is our model as expressed in JAGS. Somewhat terse.
model {
for (i in 1:N) {
x[i] ~ dnorm(mu, tau)
}
mu ~ dnorm(0, 1.0E-6)
tau <- pow(sigma, -2)
sigma ~ dunif(0, 1000)
}
To run it and examine its results, we wrap it up in some R
## Import the library that allows R to inter-work with jags.
library(rjags)
## Read the simulated data into a data frame.
fn <- read.table("example1.data", header=FALSE)
jags <- jags.model('example1.bug',
data = list('x' = fn[,1], 'N' = 20),
n.chains = 4,
n.adapt = 100)
## Burnin for 10000 samples
update(jags, 10000);
mcmc_samples <- coda.samples(jags, variable.names=c("mu", "sigma"), n.iter=20000)
png(file="diagrams/jags.png",width=400,height=350)
plot(mcmc_samples)
dev.off()
And now we can look at the posterior for .
STAN is a domain specific language for building Bayesian statistical models similar to JAGS but newer and which allows variables to be re-assigned and so cannot really be described as declarative.
Here is our model as expressed in STAN. Again, somewhat terse.
data {
int<lower=0> N;
real x[N];
}
parameters {
real mu;
real<lower=0,upper=1000> sigma;
}
model{
x ~ normal(mu, sigma);
mu ~ normal(0, 1000);
}
Just as with JAGS, to run it and examine its results, we wrap it up in some R.
library(rstan)
## Read the simulated data into a data frame.
fn <- read.table("example1.data", header=FALSE)
## Running the model
fit1 <- stan(file = 'Stan.stan',
data = list('x' = fn[,1], 'N' = 20),
pars=c("mu", "sigma"),
chains=3,
iter=30000,
warmup=10000)
png(file="diagrams/stan.png",width=400,height=350)
plot(fit1)
dev.off()
Again we can look at the posterior although we only seem to get medians and 80% intervals.
Write the histogram produced by the Haskell code to a file.
> displayHeader :: FilePath -> Diagram B R2 -> IO ()
> displayHeader fn =
> mainRender ( DiagramOpts (Just 900) (Just 700) fn
> , DiagramLoopOpts False Nothing 0
> )
> main :: IO ()
> main = do
> displayHeader "diagrams/DataScienceHaskPost.png"
> (barDiag
> (zip (map fst $ asList hist) (map snd $ asList hist)))
The code can be downloaded from github.
The other speaker at the Machine Learning Meetup at which I gave my talk on automatic differentiation gave a very interesting talk on A/B testing. Apparently this is big business these days as attested by the fact I got 3 ads above the wikipedia entry when I googled for it.
It seems that people tend to test with small sample sizes and to do so very often, resulting in spurious results. Of course readers of XKCD will be well aware of some of the pitfalls.
I thought a Bayesian approach might circumvent some of the problems and set out to write a blog article only to discover that there was no Haskell library for sampling from Student’s t. Actually there was one but is currently an unreleased part of random-fu. So I set about fixing this shortfall.
I thought I had better run a few tests so I calculated the sampled mean, variance, skewness and kurtosis.
I wasn’t really giving this my full attention and as a result ran into a few problems with space. I thought these were worth sharing and that is what this blog post is about. Hopefully, I will have time soon to actually blog about the Bayesian equivalent of A/B testing.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
>
> {-# LANGUAGE NoMonomorphismRestriction #-}
>
> module StudentTest (
> main
> ) where
>
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Data.Random.Distribution.T
> import Control.Monad.State
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List
Let’s create a reasonable number of samples as the higher moments converge quite slowly.
> nSamples :: Int
> nSamples = 1000000
An arbitrary seed for creating the samples.
> arbSeed :: Int
> arbSeed = 8
Student’s t only has one parameter, the number of degrees of freedom.
> nu :: Integer
> nu = 6
Now we can do our tests by calculating the sampled values.
> ts :: [Double]
> ts =
> evalState (replicateM nSamples (sample (T nu)))
> (pureMT $ fromIntegral arbSeed)
> mean, variance, skewness, kurtosis :: Double
> mean = (sum ts) / fromIntegral nSamples
> variance = (sum (map (**2) ts)) / fromIntegral nSamples
> skewness = (sum (map (**3) ts) / fromIntegral nSamples) / variance**1.5
> kurtosis = (sum (map (**4) ts) / fromIntegral nSamples) / variance**2
This works fine for small sample sizes but not for the number we have chosen.
./StudentTest +RTS -hc
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
It seems a shame that the function in the Prelude has this behaviour but never mind let us ensure that we consume values strictly (they are being produced lazily).
> mean' = (foldl' (+) 0 ts) / fromIntegral nSamples
> variance' = (foldl' (+) 0 (map (**2) ts)) / fromIntegral nSamples
> skewness' = (foldl' (+) 0 (map (**3) ts) / fromIntegral nSamples) / variance'**1.5
> kurtosis' = (foldl' (+) 0 (map (**4) ts) / fromIntegral nSamples) / variance'**2
We now have a space leak on the heap as using the ghc profiler below shows. What went wrong?
If we only calculate the mean using foldl then all is well. Instead of 35M we only use 45K.
Well that gives us a clue. The garbage collector cannot reclaim the samples as they are needed for other calculations. What we need to do is calculate the moments strictly altogether.
Let’s create a strict record to do this.
> data Moments = Moments { m1 :: !Double
> , m2 :: !Double
> , m3 :: !Double
> , m4 :: !Double
> }
> deriving Show
And calculate the results strictly.
>
> m = foldl' (\m x -> Moments { m1 = m1 m + x
> , m2 = m2 m + x**2
> , m3 = m3 m + x**3
> , m4 = m4 m + x**4
> }) (Moments 0.0 0.0 0.0 0.0) ts
>
> mean'' = m1 m / fromIntegral nSamples
> variance'' = m2 m / fromIntegral nSamples
> skewness'' = (m3 m / fromIntegral nSamples) / variance''**1.5
> kurtosis'' = (m4 m / fromIntegral nSamples) / variance''**2
Now we have what we want; the program runs in small constant space.
> main :: IO ()
> main = do
> putStrLn $ show mean''
> putStrLn $ show variance''
> putStrLn $ show skewness''
> putStrLn $ show kurtosis''
Oh and the moments give the expected answers.
ghci> mean''
3.9298418844289093e-4
ghci> variance''
1.4962681916693004
ghci> skewness''
1.0113188204317015e-2
ghci> kurtosis''
5.661776268997382
To run this you will need my version of random-fu. The code for this article is here. You will need to compile everything with profiling, something like
ghc -O2 -main-is StudentTest StudentTest.lhs -prof
-package-db=.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d
Since you need all the packages to be built with profiling, you will probably want to build using a sandbox as above. The only slightly tricky aspect is building random-fu so it is in your sandbox.
runghc Setup.lhs configure --enable-library-profiling
--package-db=/HasBayes/.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d
--libdir=/HasBayes/.cabal-sandbox/lib
This is meant to be shorter blog post than normal with the expectation that the material will be developed further in future blog posts.
A Bayesian will have a prior view of the distribution of some data and then based on data, update that view. Mostly the updated distribution, the posterior, will not be expressible as an analytic function and sampling via Markov Chain Monte Carlo (MCMC) is the only way to determine it.
In some special cases, when the posterior is of the same family of distributions as the prior, then the posterior is available analytically and we call the posterior and prior conjugate. It turns out that the normal or Gaussian distribution is conjugate with respect to a normal likelihood distribution.
This gives us the opportunity to compare MCMC against the analytic solution and give ourselves more confidence that MCMC really does deliver the goods.
Some points of note:
Since we want to display the posterior (and the prior for that matter), for histograms we use the histogram-fill package.
Since we are using Monte Carlo we can use all the cores on our computer via one of Haskell’s parallelization mechanisms.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE NoMonomorphismRestriction #-}
> module ConjMCMCSimple where
>
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram ( asList )
> import qualified Data.Histogram as H
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List
> import Control.Parallel.Strategies
>
> import Diagrams.Backend.Cairo.CmdLine
>
> import Diagrams.Backend.CmdLine
> import Diagrams.Prelude hiding ( sample, render )
>
> import LinRegAux
Suppose the prior is , that is
Our data is IID normal, , where is known, so the likelihood is
The assumption that is known is unlikely but the point of this post is to demonstrate MCMC matching an analytic formula.
This gives a posterior of
In other words
Writing
we get
Thus the precision (the inverse of the variance) of the posterior is the precision of the prior plus the precision of the data scaled by the number of observations. This gives a nice illustration of how Bayesian statistics improves our beliefs.
Writing
and
we see that
Thus the mean of the posterior is a weight sum of the mean of the prior and the sample mean scaled by preciscion of the prior and the precision of the data itself scaled by the number of observations.
Rather arbitrarily let us pick a prior mean of
> mu0 :: Double
> mu0 = 11.0
and express our uncertainty about it with a largish prior variance
> sigma_0 :: Double
> sigma_0 = 2.0
And also arbitrarily let us pick the know variance for the samples as
> sigma :: Double
> sigma = 1.0
We can sample from this in way that looks very similar to STAN and JAGS:
> hierarchicalSample :: MonadRandom m => m Double
> hierarchicalSample = do
> mu <- sample (Normal mu0 sigma_0)
> x <- sample (Normal mu sigma)
> return x
and we didn’t need to write a new language for this.
Again arbitrarily let us take
> nSamples :: Int
> nSamples = 10
and use
> arbSeed :: Int
> arbSeed = 2
And then actually generate the samples.
> simpleXs :: [Double]
> simpleXs =
> evalState (replicateM nSamples hierarchicalSample)
> (pureMT $ fromIntegral arbSeed)
Using the formulae we did above we can calculate the posterior
> mu_1, sigma1, simpleNumerator :: Double
> simpleNumerator = fromIntegral nSamples * sigma_0**2 + sigma**2
> mu_1 = (sigma**2 * mu0 + sigma_0**2 * sum simpleXs) / simpleNumerator
> sigma1 = sigma**2 * sigma_0**2 / simpleNumerator
and then compare it against the prior
The red posterior shows we are a lot more certain now we have some evidence.
The theory behinde MCMC is described in a previous post. We need to generate some proposed steps for the chain. We sample from the normal distribution but we could have used e.g. the gamma.
> normalisedProposals :: Int -> Double -> Int -> [Double]
> normalisedProposals seed sigma nIters =
> evalState (replicateM nIters (sample (Normal 0.0 sigma)))
> (pureMT $ fromIntegral seed)
We also need samples from the uniform distribution
> acceptOrRejects :: Int -> Int -> [Double]
> acceptOrRejects seed nIters =
> evalState (replicateM nIters (sample stdUniform))
> (pureMT $ fromIntegral seed)
And now we can calculate the (un-normalised) prior, likelihood and posterior
> prior :: Double -> Double
> prior mu = exp (-(mu - mu0)**2 / (2 * sigma_0**2))
>
> likelihood :: Double -> [Double] -> Double
> likelihood mu xs = exp (-sum (map (\x -> (x - mu)**2 / (2 * sigma**2)) xs))
>
> posterior :: Double -> [Double] -> Double
> posterior mu xs = likelihood mu xs * prior mu
The Metropolis algorithm tells us that we always jump to a better place but only sometimes jump to a worse place. We count the number of acceptances as we go.
> acceptanceProb :: Double -> Double -> [Double] -> Double
> acceptanceProb mu mu' xs = min 1.0 ((posterior mu' xs) / (posterior mu xs))
> oneStep :: (Double, Int) -> (Double, Double) -> (Double, Int)
> oneStep (mu, nAccs) (proposedJump, acceptOrReject) =
> if acceptOrReject < acceptanceProb mu (mu + proposedJump) simpleXs
> then (mu + proposedJump, nAccs + 1)
> else (mu, nAccs)
Now we can actually run our simulation. We set the number of jumps and a burn in but do not do any thinning.
> nIters, burnIn :: Int
> nIters = 300000
> burnIn = nIters `div` 10
Let us start our chain at
> startMu :: Double
> startMu = 10.0
and set the variance of the jumps to
> jumpVar :: Double
> jumpVar = 0.4
> test :: Int -> [(Double, Int)]
> test seed =
> drop burnIn $
> scanl oneStep (startMu, 0) $
> zip (normalisedProposals seed jumpVar nIters)
> (acceptOrRejects seed nIters)
We put the data into a histogram
> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
> where
> lower = startMu - 1.5*sigma_0
> upper = startMu + 1.5*sigma_0
>
> hist :: Int -> Histogram V.Vector BinD Double
> hist seed = fillBuilder hb (map fst $ test seed)
Not bad but a bit lumpy. Let’s try a few runs and see if we can smooth things out.
> hists :: [Histogram V.Vector BinD Double]
> hists = parMap rpar hist [3,4..102]
> emptyHist :: Histogram V.Vector BinD Double
> emptyHist = fillBuilder hb (replicate numBins 0)
>
> smoothHist :: Histogram V.Vector BinD Double
> smoothHist = foldl' (H.zip (+)) emptyHist hists
Quite nice and had my machine running at 750% with +RTS -N8.
Let’s create the same histogram but from the posterior created analytically.
> analPosterior :: [Double]
> analPosterior =
> evalState (replicateM 100000 (sample (Normal mu_1 (sqrt sigma1))))
> (pureMT $ fromIntegral 5)
>
> histAnal :: Histogram V.Vector BinD Double
> histAnal = fillBuilder hb analPosterior
And then compare them. Because they overlap so well, we show the MCMC, both and the analytic on separate charts.
Normally with BlogLiteratelyD, we can generate diagrams on the fly. However, here we want to run the simulations in parallel so we need to actually compile something.
ghc -O2 ConjMCMCSimple.lhs -main-is ConjMCMCSimple -threaded -fforce-recomp
> displayHeader :: FilePath -> Diagram B R2 -> IO ()
> displayHeader fn =
> mainRender ( DiagramOpts (Just 900) (Just 700) fn
> , DiagramLoopOpts False Nothing 0
> )
> main :: IO ()
> main = do
> displayHeader "http://idontgetoutmuch.files.wordpress.com/2014/03/HistMCMC.png"
> (barDiag MCMC
> (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
> (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
>
> displayHeader "http://idontgetoutmuch.files.wordpress.com/2014/03/HistMCMCAnal.png"
> (barDiag MCMCAnal
> (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
> (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
>
> displayHeader "http://idontgetoutmuch.files.wordpress.com/2014/03/HistAnal.png"
> (barDiag Anal
> (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
> (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
>
> displayHeader "http://idontgetoutmuch.files.wordpress.com/2014/03/SmoothHistMCMC.png"
> (barDiag MCMC
> (zip (map fst $ asList smoothHist) (map snd $ asList smoothHist))
> (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
Suppose we have a square thin plate of metal and we hold each of edges at a temperature which may vary along the edge but is fixed for all time. After some period depending on the conductivity of the metal, the temperature at every point on the plate will have stabilised. What is the temperature at any point?
We can calculate this using by solving Laplace’s equation in 2 dimensions. Apart from the preceeding motivation, a more compelling reason for doing so is that it is a moderately simple equation, in so far as partial differential equations are simple, that has been well studied for centuries.
In Haskell terms this gives us the opportunity to use the repa library and use hmatrix which is based on Lapack (as well as other libraries) albeit hmatrix only for illustratative purposes.
I had originally intended this blog to contain a comparison repa’s performance against an equivalent C program even though this has already been undertaken by the repa team in their various publications. And indeed it is still my intention to produce such a comparision. However, as I investigated further, it turned out a fair amount of comparison work has already been done by a team from Intel which suggests there is currently a performance gap but one which is not so large that it outweighs the other benefits of Haskell.
To be more specific, one way in which using repa stands out from the equivalent C implementation is that it gives a language in which we can specify the stencil being used to solve the equation. As an illustration we substitute the nine point method for the five point method merely by changing the stencil.
Fourier’s law states that the rate of heat transfer or the flux is proportional to the negative temperature gradient, as heat flows from hot to cold, and further that it flows in the direction of greatest temperature change. We can write this as
where is the temperature at any given point on the plate and is the conductivity of the metal.
Moreover, we know that for any region on the plate, the total amount of heat flowing in must be balanced by the amount of heat flowing out. We can write this as
Substituting the first equation into the second we obtain Laplace’s equation
For example, suppose we hold the temperature of the edges of the plate as follows
then after some time the temperature of the plate will be as shown in the heatmap below.
Notes:
Red is hot.
Blue is cold.
The heatmap is created by a finite difference method described below.
The -axis points down (not up) i.e. is at the bottom, reflecting the fact that we are using an array in the finite difference method and rows go down not up.
The corners are grey because in the five point finite difference method these play no part in determining temperatures in the interior of the plate.
Since the book I am writing contains C code (for performance comparisons), I need a way of being able to compile and run this code and include it “as is” in the book. Up until now, all my blog posts have contained Haskell and so I have been able to use BlogLiteratelyD which allows me to include really nice diagrams. But clearly this tool wasn’t really designed to handle other languages (although I am sure it could be made to do so).
Using pandoc’s scripting capability with the small script provided
#!/usr/bin/env runhaskell
import Text.Pandoc.JSON
doInclude :: Block -> IO Block
doInclude cb@(CodeBlock ("verbatim", classes, namevals) contents) =
case lookup "include" namevals of
Just f -> return . (\x -> Para [Math DisplayMath x]) =<< readFile f
Nothing -> return cb
doInclude cb@(CodeBlock (id, classes, namevals) contents) =
case lookup "include" namevals of
Just f -> return . (CodeBlock (id, classes, namevals)) =<< readFile f
Nothing -> return cb
doInclude x = return x
main :: IO ()
main = toJSONFilter doInclude
I can then include C code blocks like this
~~~~ {.c include="Chap1a.c"}
~~~~
And format the whole document like this
pandoc -s Chap1.lhs --filter=./Include -t markdown+lhs > Chap1Expanded.lhs
BlogLiteratelyD Chap1Expanded.lhs > Chap1.html
Sadly, the C doesn’t get syntax highlighting but this will do for now.
PS Sadly, WordPress doesn’t seem to be able to handle \color{red} and \color{blue} in LaTeX so there are some references to blue and red which do not render.
A lot of the code for this post is taken from the repa package itself. Many thanks to the repa team for providing the package and the example code.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE BangPatterns #-}
> {-# LANGUAGE TemplateHaskell #-}
> {-# LANGUAGE QuasiQuotes #-}
> {-# LANGUAGE NoMonomorphismRestriction #-}
> module Chap1 (
> module Control.Applicative
> , solveLaplaceStencil
> , useBool
> , boundMask
> , boundValue
> , bndFnEg1
> , fivePoint
> , ninePoint
> , testStencil5
> , testStencil9
> , analyticValue
> , slnHMat4
> , slnHMat5
> , testJacobi4
> , testJacobi6
> , bndFnEg3
> , runSolver
> , s5
> , s9
> ) where
>
> import Data.Array.Repa as R
> import Data.Array.Repa.Unsafe as R
> import Data.Array.Repa.Stencil as A
> import Data.Array.Repa.Stencil.Dim2 as A
> import Prelude as P
> import Data.Packed.Matrix
> import Numeric.LinearAlgebra.Algorithms
> import Chap1Aux
> import Control.Applicative
We show how to apply finite difference methods to Laplace’s equation:
where
For a sufficiently smooth function (see (Iserles 2009, chap. 8)) we have
where the central difference operator is defined as
We are therefore led to consider the five point difference scheme.
We can re-write this explicitly as
Specifically for the grid point (2,1) in a grid we have
where blue indicates that the point is an interior point and red indicates that the point is a boundary point. For Dirichlet boundary conditions (which is all we consider in this post), the values at the boundary points are known.
We can write the entire set of equations for this grid as
Let us take the boundary conditions to be
With our grid we can solve this exactly using the hmatrix package which has a binding to LAPACK.
First we create a matrix in hmatrix form
> simpleEgN :: Int
> simpleEgN = 4 - 1
>
> matHMat4 :: IO (Matrix Double)
> matHMat4 = do
> matRepa <- computeP $ mkJacobiMat simpleEgN :: IO (Array U DIM2 Double)
> return $ (simpleEgN - 1) >< (simpleEgN - 1) $ toList matRepa
ghci> matHMat4
(2><2)
[ -4.0, 1.0
, 1.0, 0.0 ]
Next we create the column vector as presribed by the boundary conditions
> bndFnEg1 :: Int -> Int -> (Int, Int) -> Double
> bndFnEg1 _ m (0, j) | j > 0 && j < m = 1.0
> bndFnEg1 n m (i, j) | i == n && j > 0 && j < m = 2.0
> bndFnEg1 n _ (i, 0) | i > 0 && i < n = 1.0
> bndFnEg1 n m (i, j) | j == m && i > 0 && i < n = 2.0
> bndFnEg1 _ _ _ = 0.0
> bnd1 :: Int -> [(Int, Int)] -> Double
> bnd1 n = negate .
> sum .
> P.map (bndFnEg1 n n)
> bndHMat4 :: Matrix Double
> bndHMat4 = ((simpleEgN - 1) * (simpleEgN - 1)) >< 1 $
> mkJacobiBnd fromIntegral bnd1 3
ghci> bndHMat4
(4><1)
[ -2.0
, -3.0
, -3.0
, -4.0 ]
> slnHMat4 :: IO (Matrix Double)
> slnHMat4 = matHMat4 >>= return . flip linearSolve bndHMat4
ghci> slnHMat4
(4><1)
[ 1.25
, 1.5
, 1.4999999999999998
, 1.7499999999999998 ]
Inverting a matrix is expensive so instead we use the (possibly most) classical of all iterative methods, Jacobi iteration. Given and an estimated solution , we can generate an improved estimate . See (Iserles 2009, chap. 12) for the details on convergence and convergence rates.
The simple example above does not really give a clear picture of what happens in general during the update of the estimate. Here is a larger example
Sadly, WordPress does not seem to be able to render matrices written in LaTeX so you will have to look at the output from hmatrix in the larger example below. You can see that this matrix is sparse and has a very clear pattern.
Expanding the matrix equation for a not in the we get
Cleary the values of the points in the boundary are fixed and must remain at those values for every iteration.
Here is the method using repa. To produce an improved estimate, we define a function relaxLaplace and we pass in a repa matrix representing our original estimate and receive the one step update also as a repa matrix.
We pass in a boundary condition mask which specifies which points are boundary points; a point is a boundary point if its value is 1.0 and not if its value is 0.0.
> boundMask :: Monad m => Int -> Int -> m (Array U DIM2 Double)
> boundMask gridSizeX gridSizeY = computeP $
> fromFunction (Z :. gridSizeX + 1 :. gridSizeY + 1) f
> where
> f (Z :. _ix :. iy) | iy == 0 = 0
> f (Z :. _ix :. iy) | iy == gridSizeY = 0
> f (Z :. ix :. _iy) | ix == 0 = 0
> f (Z :. ix :. _iy) | ix == gridSizeX = 0
> f _ = 1
Better would be to use at least a Bool as the example below shows but we wish to modify the code from the repa git repo as little as possible.
> useBool :: IO (Array U DIM1 Double)
> useBool = computeP $
> R.map (fromIntegral . fromEnum) $
> fromFunction (Z :. (3 :: Int)) (const True)
ghci> useBool
AUnboxed (Z :. 3) (fromList [1.0,1.0,1.0])
We further pass in the boundary conditions. We construct these by using a function which takes the grid size in the direction, the grid size in the direction and a given pair of co-ordinates in the grid and returns a value at this position.
> boundValue :: Monad m =>
> Int ->
> Int ->
> (Int -> Int -> (Int, Int) -> Double) ->
> m (Array U DIM2 Double)
> boundValue gridSizeX gridSizeY bndFn =
> computeP $
> fromFunction (Z :. gridSizeX + 1 :. gridSizeY + 1) g
> where
> g (Z :. ix :. iy) = bndFn gridSizeX gridSizeY (ix, iy)
Note that we only update an element in the repa matrix representation of the vector if it is not on the boundary.
> relaxLaplace
> :: Monad m
> => Array U DIM2 Double
> -> Array U DIM2 Double
> -> Array U DIM2 Double
> -> m (Array U DIM2 Double)
>
> relaxLaplace arrBoundMask arrBoundValue arr
> = computeP
> $ R.zipWith (+) arrBoundValue
> $ R.zipWith (*) arrBoundMask
> $ unsafeTraverse arr id elemFn
> where
> _ :. height :. width
> = extent arr
>
> elemFn !get !d@(sh :. i :. j)
> = if isBorder i j
> then get d
> else (get (sh :. (i-1) :. j)
> + get (sh :. i :. (j-1))
> + get (sh :. (i+1) :. j)
> + get (sh :. i :. (j+1))) / 4
> isBorder !i !j
> = (i == 0) || (i >= width - 1)
> || (j == 0) || (j >= height - 1)
We can use this to iterate as many times as we like.
> solveLaplace
> :: Monad m
> => Int
> -> Array U DIM2 Double
> -> Array U DIM2 Double
> -> Array U DIM2 Double
> -> m (Array U DIM2 Double)
>
> solveLaplace steps arrBoundMask arrBoundValue arrInit
> = go steps arrInit
> where
> go !i !arr
> | i == 0
> = return arr
>
> | otherwise
> = do arr' <- relaxLaplace arrBoundMask arrBoundValue arr
> go (i - 1) arr'
For our small example, we set the initial array to at every point. Note that the function which updates the grid, relaxLaplace will immediately over-write the points on the boundary with values given by the boundary condition.
> mkInitArrM :: Monad m => Int -> m (Array U DIM2 Double)
> mkInitArrM n = computeP $ fromFunction (Z :. (n + 1) :. (n + 1)) (const 0.0)
We can now test the Jacobi method
> testJacobi4 :: Int -> IO (Array U DIM2 Double)
> testJacobi4 nIter = do
> mask <- boundMask simpleEgN simpleEgN
> val <- boundValue simpleEgN simpleEgN bndFnEg1
> initArr <- mkInitArrM simpleEgN
> solveLaplace nIter mask val initArr
After 55 iterations, we obtain convergence up to the limit of accuracy of double precision floating point numbers. Note this only provides a solution of the matrix equation which is an approximation to Laplace’s equation. To obtain a more accurate result for the latter we need to use a smaller grid size.
ghci> testJacobi4 55 >>= return . pPrint
[0.0, 1.0, 1.0, 0.0]
[1.0, 1.25, 1.5, 2.0]
[1.0, 1.5, 1.75, 2.0]
[0.0, 2.0, 2.0, 0.0]
Armed with Jacobi, let us now solve a large example.
> largerEgN, largerEgN2 :: Int
> largerEgN = 6 - 1
> largerEgN2 = (largerEgN - 1) * (largerEgN - 1)
First let us use hmatrix.
> matHMat5 :: IO (Matrix Double)
> matHMat5 = do
> matRepa <- computeP $ mkJacobiMat largerEgN :: IO (Array U DIM2 Double)
> return $ largerEgN2 >< largerEgN2 $ toList matRepa
ghci> matHMat5
(16><16)
[ -4.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
, 1.0, -4.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
, 0.0, 1.0, -4.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
, 0.0, 0.0, 1.0, -4.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
, 1.0, 0.0, 0.0, 0.0, -4.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0
, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0
, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, -4.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0
, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0
, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0, 1.0, 0.0, 0.0, 1.0, 0.0
, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0, 0.0, 0.0, 0.0, 1.0
, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, -4.0, 1.0, 0.0, 0.0
, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0, 1.0, 0.0
, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0, 1.0
, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, -4.0 ]
> bndHMat5 :: Matrix Double
> bndHMat5 = largerEgN2>< 1 $ mkJacobiBnd fromIntegral bnd1 5
ghci> bndHMat5
(16><1)
[ -2.0
, -1.0
, -1.0
, -3.0
, -1.0
, 0.0
, 0.0
, -2.0
, -1.0
, 0.0
, 0.0
, -2.0
, -3.0
, -2.0
, -2.0
, -4.0 ]
> slnHMat5 :: IO (Matrix Double)
> slnHMat5 = matHMat5 >>= return . flip linearSolve bndHMat5
ghci> slnHMat5
(16><1)
[ 1.0909090909090908
, 1.1818181818181817
, 1.2954545454545454
, 1.5
, 1.1818181818181817
, 1.3409090909090906
, 1.4999999999999996
, 1.7045454545454544
, 1.2954545454545459
, 1.5
, 1.6590909090909092
, 1.818181818181818
, 1.5000000000000004
, 1.7045454545454548
, 1.8181818181818186
, 1.9090909090909092 ]
And for comparison, let us use the Jacobi method.
> testJacobi6 :: Int -> IO (Array U DIM2 Double)
> testJacobi6 nIter = do
> mask <- boundMask largerEgN largerEgN
> val <- boundValue largerEgN largerEgN bndFnEg1
> initArr <- mkInitArrM largerEgN
> solveLaplace nIter mask val initArr
ghci> testJacobi6 178 >>= return . pPrint
[0.0, 1.0, 1.0, 1.0, 1.0, 0.0]
[1.0, 1.0909090909090908, 1.1818181818181817, 1.2954545454545454, 1.5, 2.0]
[1.0, 1.1818181818181817, 1.3409090909090908, 1.5, 1.7045454545454546, 2.0]
[1.0, 1.2954545454545454, 1.5, 1.6590909090909092, 1.8181818181818183, 2.0]
[1.0, 1.5, 1.7045454545454546, 1.8181818181818181, 1.9090909090909092, 2.0]
[0.0, 2.0, 2.0, 2.0, 2.0, 0.0]
Note that with a larger grid we need more points (178) before the Jacobi method converges.
Since we are functional programmers, our natural inclination is to see if we can find an abstraction for (at least some) numerical methods. We notice that we are updating each grid element (except the boundary elements) by taking the North, East, South and West surrounding squares and calculating a linear combination of these.
Repa provides this abstraction and we can describe the update calculation as a stencil. (Lippmeier and Keller 2011) gives full details of stencils in repa.
> fivePoint :: Stencil DIM2 Double
> fivePoint = [stencil2| 0 1 0
> 1 0 1
> 0 1 0 |]
Using stencils allows us to modify our numerical method with a very simple change. For example, suppose we wish to use the nine point method (which is !) then we only need write down the stencil for it which is additionally a linear combination of North West, North East, South East and South West.
> ninePoint :: Stencil DIM2 Double
> ninePoint = [stencil2| 1 4 1
> 4 0 4
> 1 4 1 |]
We modify our solver above to take a stencil and also an Int which is used to normalise the factors in the stencil. For example, in the five point method this is 4.
> solveLaplaceStencil :: Monad m
> => Int
> -> Stencil DIM2 Double
> -> Int
> -> Array U DIM2 Double
> -> Array U DIM2 Double
> -> Array U DIM2 Double
> -> m (Array U DIM2 Double)
> solveLaplaceStencil !steps !st !nF !arrBoundMask !arrBoundValue !arrInit
> = go steps arrInit
> where
> go 0 !arr = return arr
> go n !arr
> = do arr' <- relaxLaplace arr
> go (n - 1) arr'
>
> relaxLaplace arr
> = computeP
> $ R.szipWith (+) arrBoundValue
> $ R.szipWith (*) arrBoundMask
> $ R.smap (/ (fromIntegral nF))
> $ mapStencil2 (BoundConst 0)
> st arr
We can then test both methods.
> testStencil5 :: Int -> Int -> IO (Array U DIM2 Double)
> testStencil5 gridSize nIter = do
> mask <- boundMask gridSize gridSize
> val <- boundValue gridSize gridSize bndFnEg1
> initArr <- mkInitArrM gridSize
> solveLaplaceStencil nIter fivePoint 4 mask val initArr
ghci> testStencil5 5 178 >>= return . pPrint
[0.0, 1.0, 1.0, 1.0, 1.0, 0.0]
[1.0, 1.0909090909090908, 1.1818181818181817, 1.2954545454545454, 1.5, 2.0]
[1.0, 1.1818181818181817, 1.3409090909090908, 1.5, 1.7045454545454546, 2.0]
[1.0, 1.2954545454545454, 1.5, 1.6590909090909092, 1.8181818181818183, 2.0]
[1.0, 1.5, 1.7045454545454546, 1.8181818181818181, 1.9090909090909092, 2.0]
[0.0, 2.0, 2.0, 2.0, 2.0, 0.0]
> testStencil9 :: Int -> Int -> IO (Array U DIM2 Double)
> testStencil9 gridSize nIter = do
> mask <- boundMask gridSize gridSize
> val <- boundValue gridSize gridSize bndFnEg1
> initArr <- mkInitArrM gridSize
> solveLaplaceStencil nIter ninePoint 20 mask val initArr
ghci> testStencil9 5 178 >>= return . pPrint
[0.0, 1.0, 1.0, 1.0, 1.0, 0.0]
[1.0, 1.0222650172207302, 1.1436086139049304, 1.2495750646811328, 1.4069077172153264, 2.0]
[1.0, 1.1436086139049304, 1.2964314331751594, 1.4554776038855908, 1.6710941204241017, 2.0]
[1.0, 1.2495750646811328, 1.455477603885591, 1.614523774596022, 1.777060571200304, 2.0]
[1.0, 1.4069077172153264, 1.671094120424102, 1.777060571200304, 1.7915504172099226, 2.0]
[0.0, 2.0, 2.0, 2.0, 2.0, 0.0]
We note that the methods give different answers. Before explaining this, let us examine one more example where the exact solution is known.
We take the example from (Iserles 2009, chap. 8) where the boundary conditions are:
This has the exact solution
And we can calculate the values of this function on a grid.
> analyticValue :: Monad m => Int -> m (Array U DIM2 Double)
> analyticValue gridSize = computeP $ fromFunction (Z :. gridSize + 1 :. gridSize + 1) f
> where
> f (Z :. ix :. iy) = y / ((1 + x)^2 + y^2)
> where
> y = fromIntegral iy / fromIntegral gridSize
> x = fromIntegral ix / fromIntegral gridSize
Let us also solve it using the Jacobi method with a five point stencil and a nine point stencil. Here is the encoding of the boundary values.
> bndFnEg3 :: Int -> Int -> (Int, Int) -> Double
> bndFnEg3 _ m (0, j) | j >= 0 && j < m = y / (1 + y^2)
> where y = (fromIntegral j) / (fromIntegral m)
> bndFnEg3 n m (i, j) | i == n && j > 0 && j <= m = y / (4 + y^2)
> where y = fromIntegral j / fromIntegral m
> bndFnEg3 n _ (i, 0) | i > 0 && i <= n = 0.0
> bndFnEg3 n m (i, j) | j == m && i >= 0 && i < n = 1 / ((1 + x)^2 + 1)
> where x = fromIntegral i / fromIntegral n
> bndFnEg3 _ _ _ = 0.0
We create a function to run a solver.
> runSolver ::
> Monad m =>
> Int ->
> Int ->
> (Int -> Int -> (Int, Int) -> Double) ->
> (Int ->
> Array U DIM2 Double ->
> Array U DIM2 Double ->
> Array U DIM2 Double ->
> m (Array U DIM2 Double)) ->
> m (Array U DIM2 Double)
> runSolver nGrid nIter boundaryFn solver = do
> mask <- boundMask nGrid nGrid
> val <- boundValue nGrid nGrid boundaryFn
> initArr <- mkInitArrM nGrid
> solver nIter mask val initArr
And put the five point and nine point solvers in the appropriate form.
> s5, s9 :: Monad m =>
> Int ->
> Array U DIM2 Double ->
> Array U DIM2 Double ->
> Array U DIM2 Double ->
> m (Array U DIM2 Double)
> s5 n = solveLaplaceStencil n fivePoint 4
> s9 n = solveLaplaceStencil n ninePoint 20
And now we can see that the errors between the analytic solution and the five point method with a grid size of 8 are .
ghci> liftA2 (-^) (analyticValue 7) (runSolver 7 200 bndFnEg3 s5) >>= return . pPrint
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, -3.659746856576884e-4, -5.792613003869074e-4, -5.919333582729558e-4, -4.617020226472812e-4, -2.7983716661839075e-4, -1.1394184484148084e-4, 0.0]
[0.0, -4.0566163490589335e-4, -6.681826442424543e-4, -7.270498771604073e-4, -6.163531890425178e-4, -4.157604876017795e-4, -1.9717865146007263e-4, 0.0]
[0.0, -3.4678314565880775e-4, -5.873627029994999e-4, -6.676042377350699e-4, -5.987527967581119e-4, -4.318102416048242e-4, -2.2116263241278578e-4, 0.0]
[0.0, -2.635436147627873e-4, -4.55055831294085e-4, -5.329636937312088e-4, -4.965786933938399e-4, -3.7401874422060555e-4, -2.0043638973538114e-4, 0.0]
[0.0, -1.7773949138776696e-4, -3.1086347862371855e-4, -3.714478154303591e-4, -3.5502855035249303e-4, -2.7528200465845587e-4, -1.5207424182367424e-4, 0.0]
[0.0, -9.188482657347674e-5, -1.6196970595228066e-4, -1.9595925291693295e-4, -1.903987061394885e-4, -1.5064155667735002e-4, -8.533752030373543e-5, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
But using the nine point method significantly improves this.
ghci> liftA2 (-^) (analyticValue 7) (runSolver 7 200 bndFnEg3 s9) >>= return . pPrint
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, -2.7700522166329566e-7, -2.536751151638317e-7, -5.5431452705700934e-8, 7.393573120406671e-8, 8.403487600228132e-8, 4.188249685954659e-8, 0.0]
[0.0, -2.0141002235463112e-7, -2.214645128950643e-7, -9.753369634157849e-8, 2.1887763435035623e-8, 6.305346988977334e-8, 4.3482495659663556e-8, 0.0]
[0.0, -1.207601019737048e-7, -1.502713803391842e-7, -9.16850228516175e-8, -1.4654435886995998e-8, 2.732932558036083e-8, 2.6830928867571657e-8, 0.0]
[0.0, -6.883445567013036e-8, -9.337114890983766e-8, -6.911451747027009e-8, -2.6104150896433254e-8, 4.667329939200826e-9, 1.1717137371469732e-8, 0.0]
[0.0, -3.737430460254432e-8, -5.374955715231611e-8, -4.483740087546373e-8, -2.299792309368165e-8, -4.122571728437663e-9, 3.330287268177301e-9, 0.0]
[0.0, -1.6802381437586167e-8, -2.5009212159532446e-8, -2.229028683853329e-8, -1.3101905282919546e-8, -4.1197137368165215e-9, 3.909041701444238e-10, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Iserles, A. 2009. A First Course in the Numerical Analysis of Differential Equations. A First Course in the Numerical Analysis of Differential Equations. Cambridge University Press. http://books.google.co.uk/books?id=M0tkw4oUucoC.
Lippmeier, Ben, and Gabriele Keller. 2011. “Efficient Parallel Stencil Convolution in Haskell.” In Proceedings of the 4th ACM Symposium on Haskell, 59–70. Haskell ’11. New York, NY, USA: ACM. doi:10.1145/2034675.2034684. http://doi.acm.org/10.1145/2034675.2034684.
I’ve used knitr to produce this post rather than my usual BlogLiteratelyD.
For example, let us plot an index.
First we load the quantmod library
library(quantmod)
We can chart the S&P 500 for 2013.
GSPC <- getSymbols("^GSPC", src = "yahoo", auto.assign = FALSE)
dim(GSPC)
## [1] 1768 6
head(GSPC, 4)
## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2007-01-03 1418 1429 1408 1417 3.429e+09
## 2007-01-04 1417 1422 1408 1418 3.004e+09
## 2007-01-05 1418 1418 1406 1410 2.919e+09
## 2007-01-08 1409 1415 1404 1413 2.763e+09
## GSPC.Adjusted
## 2007-01-03 1417
## 2007-01-04 1418
## 2007-01-05 1410
## 2007-01-08 1413
tail(GSPC, 4)
## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2014-01-06 1832 1837 1824 1827 3.295e+09
## 2014-01-07 1829 1840 1829 1838 3.512e+09
## 2014-01-08 1838 1840 1831 1837 3.652e+09
## 2014-01-09 1839 1843 1830 1838 3.581e+09
## GSPC.Adjusted
## 2014-01-06 1827
## 2014-01-07 1838
## 2014-01-08 1837
## 2014-01-09 1838
chartSeries(GSPC, subset = "2013", theme = "white")
We can also chart currencies e.g. the Rupee / US Dollar exchange rate.
INRUSD <- getSymbols("INR=X", src = "yahoo", auto.assign = FALSE)
dim(INRUSD)
## [1] 1805 6
head(INRUSD, 4)
## INR=X.Open INR=X.High INR=X.Low INR=X.Close INR=X.Volume
## 2007-01-01 44.22 44.22 44.04 44.22 0
## 2007-01-02 44.21 44.22 44.08 44.12 0
## 2007-01-03 44.12 44.41 44.09 44.11 0
## 2007-01-04 44.12 44.48 44.10 44.10 0
## INR=X.Adjusted
## 2007-01-01 44.22
## 2007-01-02 44.12
## 2007-01-03 44.11
## 2007-01-04 44.10
tail(INRUSD, 4)
## INR=X.Open INR=X.High INR=X.Low INR=X.Close INR=X.Volume
## 2014-01-01 61.84 61.97 61.80 61.80 0
## 2014-01-02 61.84 62.41 61.74 61.84 0
## 2014-01-03 62.06 62.57 62.06 62.06 0
## 2014-01-06 62.23 62.45 61.94 62.23 0
## INR=X.Adjusted
## 2014-01-01 61.80
## 2014-01-02 61.84
## 2014-01-03 62.06
## 2014-01-06 62.23
chartSeries(INRUSD, subset = "2013", theme = "white")
Frankau, Simon, Diomidis Spinellis, Nick Nassuphis, and Christoph Burgard. 2009. “Commercial Uses: Going Functional on Exotic Trades.” J. Funct. Program. 19 (1) (jan): 27–45. doi:10.1017/S0956796808007016. http://dx.doi.org/10.1017/S0956796808007016.
The idea is that two continuous functions are homotopic if they can be deformed continuously into each other. All functions from now on are considered continuous unless otherwise specified.
More formally if are homotopic then there exists a function such that and .
A path is a map where is the unit interval . A homotopy of paths is a function such that and . We write . The two paths and are homotopic and we write . Homotopy of paths is an equivalence relation, a fact which we do not prove here. The equivalence class of a path is denoted .
Where the end point of one path is the start point of another path, we can join these together to form new paths. This joining operation respects homotopy classes as the diagram below illustrates.
A loop is a path with .
We can join loops based at the same point together to form new loops giving a group operation on equivalence classes under homotopy. In more detail, suppose that and are loops then we can form the new loop which is the obvious loop formed by first traversing and then traversing . The inverse of the loop is the loop and the identity is the constant loop . The way we have defined homotopy ensures that this is all well defined (see either of the on-line books for the details).
The group of loops under these group operations is called the fundamental group or the first homotopy group and is denoted .
Theorem where the latter denotes the integers with the group structure being given by addition.
Proof Define
by
that is for each integer we define an equivalence class of loops. In more standard notation (not using as a way of introducing anonymous functions)
where
Further we have and clearly . Thus making a homomorphism.
We wish to show that is an isomorphism. To do this we are going to take a loop on and lift it to a path on so that the map which wraps around the circle projects this lifted map back to the original map: . Since we must have that for some integer (this will be the number of times the loop winds around the origin). Let us call the map that creates this integer from a loop on the circle
Note that we also have so the lifted path could start at any integer. For the sake of definiteness let us start it at .
We now construct this map, , show that it is well defined and is a homomorphism.
We know that is a smooth manifold. Let us give it an atlas.
We define the co-ordinate maps as . The transition maps are then just a translation around the circle. For example
And these are clearly smooth.
By continuity of , every point in is contained in some open interval whose map under is contained in one of the charts. This gives an open cover of . By the compactness of (it is closed and bounded), we therefore have a finite subcover. From this we can construct such that for some and .
Since we could to define . But how do we define beyond ?
We know that
where is a homeomorphism from each to .
So we can equivalently define
where has been chosen to contain .
Given our specific atlas we have
Suppose we have the loop , then we could define
So that
And now we know how to continue. There must be a such that .
So we can define
where has been chosen to contain .
Given our specific atlas we have
With our specific example we could define
So that (again)
Continuing in this fashion, after a finite number of steps we have defined on the entirety of . Note that this construction gives a unique path as at each point the value of on is uniquely determined by its value at (and of course by itself).
We still have a problem that if is homotopic to then might not equal .
Assume we have a homotopy then since is compact we can proceed as above and choose a partition of of rectangles
with , , and such that for some .
Thus we can define on as
where has been chosen to contain .
We can continue as before; we know there must be a such that so we can define
where has been chosen to contain for all using the previously defined .
Eventually we will have defined on the whole of . But now we can start the same process for and by choosing to be the value of the previously defined and by the uniquess of lifts of paths we can define on the whole of .
Carrying on in this way we will have defined on the whole of .
and must be constant paths because for all (constant lifted paths get projected onto these and they are unique).
Since we require that , we must have that .
By the uniquess of lifted paths we must then have
Since is constant we must have that . Thus is well defined on homotopy classes.
Recall that so that so that is a homomorphism.
Suppose that then with a homotopy of loops. This lifts to a unique homotopy of paths with and . Since is a homotopy of paths, the end point is fixed. Thus and is injective.
Suppose that is a loop starting at 1 then there is a lift starting at . We also have that by the linear homotopy . Thus and is surjective.
Hatcher, A. 2002. Algebraic Topology. Tsinghua University. http://books.google.co.uk/books?id=xsIiEhRfwuIC.
May, J.P. 1999. A Concise Course in Algebraic Topology. Chicago Lectures in Mathematics. University of Chicago Press. http://books.google.co.uk/books?id=g8SG03R1bpgC.
About a year ago there was a reddit post on the Ising Model in Haskell. The discussion seems to have fizzled out but Ising models looked like a perfect fit for Haskell using repa. In the end it turns out that they are not a good fit for repa, at least not using the original formulation. It may turn out that we can do better with Swendson-Yang or Wolff. But that belongs to another blog post.
We can get some parallelism at a gross level using the Haskell parallel package via a one line change to the sequential code. However, this does not really show off Haskell’s strengths in this area. In any event, it makes a good example for “embarassingly simple” parallelism in Haskell, the vector package and random number generation using the random-fu package.
The Ising model was (by Stigler’s law) proposed by Lenz in 1920 as a model for ferromagnetism, that is, the magnetism exhibited by bar magnets. The phenomenon ferromagnetism is so named because it was first observed in iron (Latin ferrum and chemical symbol Fe). It is also exhibited, for example, by rare earths such as gadolinium. Ferromagnetic materials lose their magnetism at a critical temperature: the Curie temperature (named after Pierre not his more famous wife). This is an example of a phase transition (ice melting into water is a more familiar example).
The Ising model (at least in 2 dimensions) predicts this phase transition and can also be used to describe phase transitions in alloys.
Abstracting the Ising model from its physical origins, one can think of it rather like Conway’s Game of Life: there is a grid and each cell on the grid is updated depending on the state of its neighbours. The difference with the Game of Life is that the updates are not deterministic but are random with the randomness selecting which cell gets updated as well as whether it gets updated. Thus we cannot update all the cells in parallel as would happen if we used repa. The reader only interested in this abstraction can go straight to the implementation (after finishing this introduction).
The diagram below shows a 2 dimensional grid of cells. Each cell can either be in an (spin) up state or (spin) down state as indicated by the arrows and corresponding colours. The Ising model then applies a parameterized set of rules by which the grid is updated. For certain parameters the cells remain in a random configuration, that is the net spin (taking up = 1 and down = -1) remains near zero; for other parameters, the spins in the cells line up (not entirely as there is always some randomness). It is this lining up that gives rise to ferromagnetism.
On the other hand, the physics and the Monte Carlo method used to simulate the model are of considerable interest in their own right. Readers interested in the Monte Carlo method can skip the physics and go to Monte Carlo Estimation. Readers interested in the physics can start with the section on Magnetism.
Definitions, theorems etc. are in bold and terminated by .
Following Ziman (Ziman 1964) and Reif (Reif 2009), we assume that each atom in the ferromagnetic material behaves like a small magnet. According to Hund’s rules, we would expect unpaired electrons in the and shells for example in the transition elements and rare earths and these would supply the magnetic moment. However, the magnetic interaction between atoms is far too small to account for ferromagnetism. For Iron, for example, the Curie temperature is 1043K and the magnetic interaction between atoms cannot account for this by some margin. There is a quantum mechanical effect called the exchange interaction which is a consequence of the Pauli exclusion principle. Two electrons with the same spin on neighbouring atoms cannot get too close to each other. If the electrons have anti-parallel spins then the exclusion principle does not apply and there is no restriction on how close they can get to each other. Thus the electrostatic interaction between electrons on neighbouring atoms depends on whether spins are parallel or anti-parallel. We can thus write the Hamiltonian in the form:
Where
The notation means we sum over all the nearest neighbours in the lattice;
is the applied magnetic field (note we use E for the Hamiltonian), for all of this article we will assume this to be ;
The range of each index is where is the total number of atoms;
And the coupling constant expressing the strength of the interaction between neighboring spins and depending on the balance between the Pauli exclusion principle and the electrostatic interaction energy of the electrons, this may be positive corresponding to parallel spins (ferromagnetism which is the case we consider in this article) or negative corresponding to anti parallel spins (antiferromagnetism or ferrimagnetism which we consider no further).
This post uses the random-fu package for random number generation and has also benefitted from comments by the author of that package (James Cook).
All diagrams were drawn using the Haskell diagrams domain specific language; the inhabitants of #diagrams were extremely helpful in helping create these.
Internet sources too numerous to mention were used for the physics and Monte Carlo. Some are listed in the bibliography. Apologies if you recognize something which does not get its just acknowledgement. The advantage of blog posts is that this can easily be remedied by leaving a comment.
Pragmas and imports to which only the over-enthusiastic reader need pay attention.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE NoMonomorphismRestriction #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> module Ising (
> energy
> , magnetization
> , McState(..)
> , thinN
> , gridSize
> , nItt
> , expDv
> , tCrit
> , singleUpdate
> , randomUpdates
> , initState
> , multiUpdate
> , temps
> , initColdGrid
> , exampleGrid
> , notAperiodics
> , main
> ) where
We put all of our code for the diagrams in this blog post in a separate module to avoid clutter.
> import Diagrams
> import qualified Data.Vector.Unboxed as V
> import qualified Data.Vector.Unboxed.Mutable as M
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Control.Parallel.Strategies
> import Graphics.Rendering.Chart.Backend.Cairo
> import Data.Array.Repa hiding ( map, (++), zipWith )
> import Data.Array.Repa.Algorithms.Matrix
> import PrettyPrint ()
> gridSize :: Int
> gridSize = 20
>
> exampleGrid :: Int -> V.Vector Int
> exampleGrid gridSize = V.fromList $ f (gridSize * gridSize)
> where
> f m =
> evalState (replicateM m (sample (uniform (0 :: Int) (1 :: Int)) >>=
> \n -> return $ 2*n - 1))
> (pureMT 1)
> couplingConstant :: Double
> couplingConstant = 1.0
>
> kB :: Double
> kB = 1.0
>
> mu :: Double
> mu = 1.0
Statistical physics is an extremely large subject but there needs to be some justification for the use of the Boltzmann distribution. For an excellent introduction to the subject of Statistical Physics (in which the Boltzmann distribution plays a pivotal role) see David Tong’s lecture notes (Tong (2011)).
Suppose we have we 3 boxes (we use the more modern nomenclature rather than the somewhat antiquated word urn) and 7 balls and we randomly assign the balls to boxes. Then it is far more likely that we will get an assignment of 2,2,3 rather than 0,0,7. When the numbers of boxes and balls become large (which is the case in statistical physics where we consider e.g. numbers of atoms) then it becomes very, very, very likely that the balls (or atoms) will spread themselves out uniformly as in the example.
Now suppose we have balls and boxes and we associate an energy with the -th box but restrict the total energy to be constant. Thus we have two constraints:
The total number of balls and
The total energy .
Let us assume the balls are allocated to boxes in the most likely way and let us denote the values in each box for this distribution as .
Let us now move 2 balls from box , one to box and one to box . Note that both constraints are satisified by this move. We must therefore have
And let us start from the most likely distribution and move 1 ball from box and 1 ball from box into box . Again both constraints are satisified by this move. Doing a similar calculation to the one above we get
From this we deduce that as then or that for some constant .
Telecsoping we can write . Thus the probability of a ball being in box is
If we now set and then we can re-write our distribution as
It should therefore be plausible that the probability of finding a system with a given energy in a given state is given by the Boltzmann distribution
where we have defined the temperature to be with being Boltzmann’s constant and is another normalizing constant.
Using the Boltzmann distribution we can calculate the average energy of the system
where it is understood that depends on .
If we let be the specific heat per particle (at constant volume) then
We know that
So we can write
so if we can estimate the energy and the square of the energy then we can estimate the specific heat.
Although Ising himself developed an analytic solution in 1 dimension and Onsager later developed an analytic solution in 2 dimensions, no-one has (yet) found an analytic solution for 3 dimensions.
One way to determine an approximate answer is to use a Monte Carlo method.
We could pick sample configurations at random according to the Boltzmann distribution
where the sum is the temperature, is Boltzmann’s constant, is the energy of a given state
and is a normalizing constant (Z for the German word Zustandssumme, “sum over states”)
We can evaluate the energy for one state easily enough as the Haskell below demonstrates. Note that we use so-called periodic boundary conditions which means our grid is actually a torus with no boundaries. In other words, we wrap the top of the grid on to the bottom and the left of the grid on to the right.
[As an aside, we represent each state by a Vector of Int. No doubt more efficient representations can be implemented. We also have to calculate offsets into the vector given a point’s grid co-ordinates.]
> energy :: (Fractional a, Integral b, M.Unbox b) => a -> V.Vector b -> a
> energy j v = -0.5 * j * (fromIntegral $ V.sum energyAux)
> where
>
> energyAux = V.generate l f
>
> l = V.length v
>
> f m = c * d
> where
> i = m `mod` gridSize
> j = (m `mod` (gridSize * gridSize)) `div` gridSize
>
> c = v V.! jc
> jc = gridSize * i + j
>
> d = n + e + s + w
>
> n = v V.! jn
> e = v V.! je
> s = v V.! js
> w = v V.! jw
>
> jn = gridSize * ((i + 1) `mod` gridSize) + j
> js = gridSize * ((i - 1) `mod` gridSize) + j
> je = gridSize * i + ((j + 1) `mod` gridSize)
> jw = gridSize * i + ((j - 1) `mod` gridSize)
But what about the normalizing constant ? Even for a modest grid size say , the number of states that needs to be summed over is extremely large .
Instead of summing the entire state space, we could draw R random samples uniformly from the state space. We could then use
to estimate e.g. the magnetization
by
However we know from statistical physics that systems with large numbers of particles will occupy a small portion of the state space with any significant probability. And according to (MacKay 2003, chap. 29), a high dimensional distribution is often concentrated on small region of the state space known as its typical set whose volume is given by where is the (Shannon) entropy of the (Boltzmann) distribution which for ease of exposition we temporarily denote by .
If almost all the probability mass is located in then the actual value of the (mean) magnetization will determined by the values that takes on that set. So uniform sampling will only give a good estimate if we make large enough that we hit at least a small number of times. The total size of the state space is and the , so there is a probability of of hitting . Thus we need roughly samples to hit .
At high temperatures, the Boltzmann distribution flattens out so roughly all of the states have an equal likelihood of being occupied. We can calculate the (Shannon) entropy for this.
We can do a bit better than this. At high temperatures, and taking (as we have been assuming all along)
By definition
Thus
We can therefore re-write the partition function as
where the factor 2 is because we count the exchange interaction twice for each pair of atoms and the factor 4 is because each atom is assumed to only interact with 4 neighbours.
Since at high temperatures, by assumption, we have then also.
Thus
Calculating the free energy
From this we can determine the (Boltzmann) entropy
which agrees with our rather hand-wavy derivation of the (Shannon) entropy at high temperatures.
At low temperatures the story is quite different. We can calculate the ground state energy where all the spins are in the same direction.
And we can assume that at low temperatures that flipped spins are isolated from other flipped spins. The energy for an atom in the ground state is -4J and thus the energy change if it flips is 8J. Thus the energy at low temperatures is
The partition function is
Again we can calculate the free energy and since we are doing this calculation to get a rough estimate of the entropy we can approximate further.
From this we can determine the (Boltzmann) entropy. Using the fact that and thus that
we have
The critical temperature (as we shall obtain by simulation) given by Onsager’s exact result in 2d is
Taking
> tCrit :: Double
> tCrit = 2.0 * j / log (1.0 + sqrt 2.0) where j = 1
ghci> tCrit
2.269185314213022
Plugging this in we get
ghci> exp (-8.0/tCrit) * (1.0 + 8.0 / tCrit)
0.1332181153896559
Thus the (Shannon) entropy is about 0.13N at the interesting temperature and is about N at high temperatures. So uniform sampling would require samples at high temperatures but at temperatures of interest. Even for our modest grid this is samples!
Fortunately, Metropolis and his team (Metropolis et al. 1953) discovered a way of constructing a Markov chain with a limiting distribution of the distribution required which does not require the evaluation of the partition function and which converges in a reasonable time (although theoretical results substantiating this latter point seem to be hard to come by).
Markov first studied the stochastic processes that came to be named after him in 1906.
We follow (Norris 1998), (Beichl and Sullivan 2000), (Diaconis and Saloff-Coste 1995), (Chib and Greenberg 1995) (Levin, Peres, and Wilmer 2008) and (Gravner 2011).
As usual we work on a probability measure space (that is ). Although this may not be much in evidence, it is there lurking behind the scenes.
Let be a finite set. In the case of an Ising model with cells, this set will contain elements (all possible configurations).
A Markov chain is a discrete time stochastic process such that
That is, where a Markov chain goes next only depends on where it is not on its history.
A stochastic transition matrix is a matrix such that
We can describe a Markov chain by its transition matrix and initial distribution . In the case we say a stochastic process is Markov .
We need to be able to discuss properties of Markov chains such as stationarity, irreducibility, recurrence and ergodicity.
A Markov chain has a stationary distribution if
One question one might ask is whether a given Markov chain has such a distribution. For example, for the following chain, any distribution is a stationary distribution. That is for any .
Any distribution is a stationary distribution for the unit transition matrix.
The -th transition matrix of a Markov chain is . The corresponding matrix entries are
Another key question is, if there is a unique stationary distribution, will the -th transition probabilities converge to that distribution, that is, when does, as .
Write
We say that leads to and write if
Theorem For distinct states and , the following are equivalent:
This makes it clear that is transitive and reflexive hence an equivalence relation. We can therefore partition the states into classes. If there is only one class then the chain is called irreducible.
For example,
has classes , and so is not irreducible.
On the other hand
is irreducible.
Let be a Markov chain. A state is recurrent if
The first passage time is defined as
Note that the is taken over strictly greater than 1. Incidentally the first passage time is a stopping time but that any discussion of that would take this already long article even longer.
The expectation of the -th first passage time starting from is denoted .
Theorem Let be irreducible then the following are equivalent:
Every state is positive recurrent.
Some state is positive recurrent.
P has an invariant distribution and in this case .
A state is aperiodic if for all sufficiently large .
For example, the chain with this transition matrix is not periodic:
as running the following program segment shows with the chain flip-flopping between the two states.
> notAperiodic0 :: Array U DIM2 Double
> notAperiodic0 = fromListUnboxed (Z :. (2 :: Int) :. (2 :: Int)) ([0,1,1,0] :: [Double])
> notAperiodics :: [Array U DIM2 Double]
> notAperiodics = scanl mmultS notAperiodic0 (replicate 4 notAperiodic0)
ghci> import Ising
ghci> import PrettyPrint
ghci> import Text.PrettyPrint.HughesPJClass
ghci> pPrint notAperiodics
[[0.0, 1.0]
[1.0, 0.0],
[1.0, 0.0]
[0.0, 1.0],
[0.0, 1.0]
[1.0, 0.0],
[1.0, 0.0]
[0.0, 1.0],
[0.0, 1.0]
[1.0, 0.0]]
Theorem Let be irreducible and aperiodic and suppose that has an invariant distribution . Let be any distribution (on the state space???). Suppose that is Markov then
A proof of this theorem uses coupling developed by Doeblin; see (Gravner 2011) for more details.
Corollary With the conditions of the preceding Theorem
If the state space is infinite, the existence of a stationary distribution is not guaranteed even if the Markov chain is irreducible, see (Srikant 2009) for more details.
A stochastic matrix and a distribution are said to be in detailed balance if
Theorem If a stochastic matrix and a distribution are in detailed balance then is a stationary distribution.
Define the number of visits to strictly before time as
is the proportion of time before spent in state .
Theorem Let be an irreducible Markov chain then
If further the chain is positive recurrent then for any bounded function then
If further still the chain is aperiodic then it has a unique stationary distribution, which is also the limiting distribution
A Markov chain satisfying all three conditions is called ergodic.
Thus if we can find a Markov chain with the required stationary distribution and we sample a function of this chain we will get an estimate for the average value of the function. What Metropolis and his colleagues did was to provide a method of producing such a chain.
Algorithm Let be a probability distribution on the state space with for all and let be an ergodic Markov chain on with transition probabilities (the latter condition is slightly stronger than it need be but we will not need fully general conditions).
Create a new (ergodic) Markov chain with transition probabilities
where takes the maximum of its arguments.
Calculate the value of interest on the state space e.g. the total magnetization for each step produced by this new chain.
Repeat a sufficiently large number of times and take the average. This gives the estimate of the value of interest.
Let us first note that the Markov chain produced by this algorithm almost trivially satisfies the detailed balance condition, for example,
Secondly since we have specified that is ergodic then clearly is also ergodic (all the transition probabilities are ).
So we know the algorithm will converge to the unique distribution we specified to provide estimates of values of interest.
Two techniques that seem to be widespread in practical applications are burn in and thinning. Although neither have strong theoretical justification (“a thousand lemmings can’t be wrong”), we follow the practices in our implementation.
“Burn in” means run the chain for a certain number of iterations before sampling to allow it to forget the initial distribution.
“Thinning” means sampling the chain every iterations rather than every iteration to prevent autocorrelation.
Calculating the total magnetization is trivial; we just add up all the spins and multiply by the magnetic moment of the electron.
> magnetization :: (Num a, Integral b, M.Unbox b) => a -> V.Vector b -> a
> magnetization mu = (mu *) . fromIntegral . V.sum
We keep the state of the Monte Carlo simulation in a record.
> data McState = McState { mcMagnetization :: !Double
> , mcMAvg :: !Double
> , mcEnergy :: !Double
> , mcEAvg :: !Double
> , mcEAvg2 :: !Double
> , mcCount :: !Int
> , mcNumSamples :: !Int
> , mcGrid :: !(V.Vector Int)
> }
> deriving Show
As discussed above we sample every thinN iterations, a technique known as “thinning”.
> thinN :: Int
> thinN = 100
The total number of iterations per Monte Carlo run.
> nItt :: Int
> nItt = 1000000
There are only a very limited number of energy changes that can occur for each spin flip. Rather the recalculate the value of the Boltzmann distribution for every spin flip we can store these in a Vector.
For example if spin is up and all its surrounding spins are up and it flips to down then energy change is and the corresponding value of the Boltzmann distribution is .
Another example, the energy before is and the energy after is so the energy change is .
> expDv :: Double -> Double -> Double -> V.Vector Double
> expDv kB j t = V.generate 9 f
> where
> f n | odd n = 0.0
> f n = exp (j * ((fromIntegral (8 - n)) - 4.0) * 2.0 / (kB * t))
The most important function is the single step update of the Markov chain. We take an Int representing the amount of thinning we wish to perform, the vector of pre-calculated changes of the Boltzmann distribution, the current state, a value representing the randomly chosen co-ordinates of the grid element that will be updated and a value sampled from the uniform distribution which will decide whether the spin at the co-ordinates will be updated.
> singleUpdate :: Int -> V.Vector Double -> McState -> (Int, Int, Double) -> McState
> singleUpdate thinN expDvT u (i, j, r) =
> McState { mcMagnetization = magNew
> , mcMAvg = mcMAvgNew
> , mcEnergy = enNew
> , mcEAvg = mcEAvgNew
> , mcEAvg2 = mcEAvg2New
> , mcCount = mcCount u + 1
> , mcNumSamples = mcNumSamplesNew
> , mcGrid = gridNew
> }
> where
>
> (gridNew, magNew, enNew) =
> if p > r
> then ( V.modify (\v -> M.write v jc (-c)) v
> , magOld - fromIntegral (2 * c)
> , enOld + couplingConstant * fromIntegral (2 * c * d)
> )
> else (v, magOld, enOld)
>
> magOld = mcMagnetization u
> enOld = mcEnergy u
>
> (mcMAvgNew, mcEAvgNew, mcEAvg2New, mcNumSamplesNew) =
> if (mcCount u) `mod` thinN == 0
> then ( mcMAvgOld + magNew
> , mcEAvgOld + enNew
> , mcEAvg2Old + enNew2
> , mcNumSamplesOld + 1
> )
> else (mcMAvgOld, mcEAvgOld, mcEAvg2Old, mcNumSamplesOld)
>
> enNew2 = enNew * enNew
>
> mcMAvgOld = mcMAvg u
> mcEAvgOld = mcEAvg u
> mcEAvg2Old = mcEAvg2 u
> mcNumSamplesOld = mcNumSamples u
>
> v = mcGrid u
>
> p = expDvT V.! (4 + c * d)
>
> c = v V.! jc
> jc = gridSize * i + j
>
> d = n + e + s + w
>
> n = v V.! jn
> e = v V.! je
> s = v V.! js
> w = v V.! jw
>
> jn = gridSize * ((i + 1) `mod` gridSize) + j
> js = gridSize * ((i - 1) `mod` gridSize) + j
> je = gridSize * i + ((j + 1) `mod` gridSize)
> jw = gridSize * i + ((j - 1) `mod` gridSize)
In order to drive our Markov chain we need a supply of random positions and samples from the uniform distribution.
> randomUpdates :: Int -> V.Vector (Int, Int, Double)
> randomUpdates m =
> V.fromList $
> evalState (replicateM m x)
> (pureMT 1)
> where
> x = do r <- sample (uniform (0 :: Int) (gridSize - 1))
> c <- sample (uniform (0 :: Int) (gridSize - 1))
> v <- sample (uniform (0 :: Double) 1.0)
> return (r, c, v)
To get things going, we need an initial state. We start with a cold grid, that is, one with all spins pointing down.
> initColdGrid :: V.Vector Int
> initColdGrid = V.fromList $ replicate (gridSize * gridSize) (-1)
>
> initState :: McState
> initState = McState { mcMagnetization = magnetization mu initColdGrid
> , mcMAvg = 0.0
> , mcEnergy = energy couplingConstant initColdGrid
> , mcEAvg = 0.0
> , mcEAvg2 = 0.0
> , mcCount = 0
> , mcNumSamples = 0
> , mcGrid = initColdGrid
> }
We will want to run the simulation over a range of temperatures in order to determine where the phase transition occurs.
> temps :: [Double]
> temps = getTemps 4.0 0.5 100
> where
> getTemps :: Double -> Double -> Int -> [Double]
> getTemps h l n = [ m * x + c |
> w <- [1..n],
> let x = fromIntegral w ]
> where
> m = (h - l) / (fromIntegral n - 1)
> c = l - m
Now we can run a chain at a given temperature.
> multiUpdate :: McState -> Double -> V.Vector (Int, Int, Double) -> McState
> multiUpdate s t = V.foldl (singleUpdate thinN (expDv kB couplingConstant t)) s
For example running the model at a temperature of for 100, 1000 and 10,000 steps respectively shows disorder growing.
On the other hand running the model at a temperature of shows a very limited disorder.
For any given state we can extract the magnetization, the energy and the square of the energy.
> magFn :: McState -> Double
> magFn s = abs (mcMAvg s / (fromIntegral $ mcNumSamples s))
>
> magFnNorm :: McState -> Double
> magFnNorm s = magFn s / (fromIntegral (gridSize * gridSize))
>
> enFn :: McState -> Double
> enFn s = mcEAvg s / (fromIntegral $ mcNumSamples s)
>
> enFnNorm :: McState -> Double
> enFnNorm s = enFn s / (fromIntegral (gridSize * gridSize))
> en2Fn :: McState -> Double
> en2Fn s = mcEAvg2 s / (fromIntegral $ mcNumSamples s)
And we can also calculate the mean square error.
> meanSqErr :: McState -> Double
> meanSqErr s = e2 - e*e
> where
> e = enFn s
> e2 = en2Fn s
>
> meanSqErrNorm :: McState -> Double
> meanSqErrNorm s = meanSqErr s / (fromIntegral (gridSize * gridSize))
Finally we can run our simulation in parallel using parMap rather than map (this is the one line change required to get parallelism).
> main :: IO ()
> main = do print "Start"
>
> let rs = parMap rpar f temps
> where
> f t = multiUpdate initState t (randomUpdates nItt)
>
> renderableToFile (FileOptions (500, 500) PNG)
> (errChart temps rs magFnNorm enFnNorm meanSqErrNorm)
> "diagrams/MagnetismAndEnergy.png"
>
> print "Done"
ghc -O2 Ising.lhs -threaded -o Ising -package-db=.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d/ -main-is Ising
time ./Ising +RTS -N1
"Start"
"Done"
real 0m14.879s
user 0m14.508s
sys 0m0.369s
time ./Ising +RTS -N2
"Start"
"Done"
real 0m8.269s
user 0m15.521s
sys 0m0.389s
time ./Ising +RTS -N4
"Start"
"Done"
real 0m5.444s
user 0m19.386s
sys 0m0.414s
The sources for this article can be downloaded here.
Beichl, Isabel, and Francis Sullivan. 2000. “The Metropolis Algorithm.” In Computing in Science and Engg., 2:65–69. 1. Piscataway, NJ, USA: IEEE Educational Activities Department. doi:10.1109/5992.814660.
Chib, Siddhartha, and Edward Greenberg. 1995. “Understanding the Metropolis-hastings Algorithm.” The American Statistician 49 (4) (November): 327–335. http://www.jstor.org/stable/2684568.
Diaconis, Persi, and Laurent Saloff-Coste. 1995. “What Do We Know About the Metropolis Algorithm?” In Proceedings of the Twenty-seventh Annual ACM Symposium on Theory of Computing, 112–129. STOC ’95. New York, NY, USA: ACM. doi:10.1145/225058.225095. http://doi.acm.org/10.1145/225058.225095.
Gravner, Janko. 2011. “MAT135A Probability.” https://www.math.ucdavis.edu/~gravner/MAT135A/resources/lecturenotes.pdf.
Levin, David A., Yuval Peres, and Elizabeth L. Wilmer. 2008. Markov Chains and Mixing Times. American Mathematical Society. http://pages.uoregon.edu/dlevin/MARKOV/markovmixing.pdf.
MacKay, David J. C. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge University Press. http://www.cambridge.org/0521642981.
Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equation of State Calculations by Fast Computing Machines.” Journal of Chemical Physics 21: 1087–1092.
Norris, James R. 1998. Markov Chains. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
Reif, F. 2009. Fundamentals of Statistical and Thermal Physics. McGraw-hill Series in Fundamentals of Physics. Waveland Press, Incorporated. http://books.google.co.uk/books?id=gYpBPgAACAAJ.
Srikant, R. 2009. “ECE534 Random Processes.” http://www.ifp.illinois.edu/~srikant/ECE534/Spring09/DTMC.pdf.
Tong, David. 2011. “Lectures on Statistical Physcs.” http://www.damtp.cam.ac.uk/user/tong/statphys.html.
Ziman, J.M. 1964. Principles of the Theory of Solids. London, England: Cambridge University Press.
Apparently in the world of car parking where Westminster leads the rest of UK follows. For example Westminster is rolling out individual parking bay monitors.
Our analysis gained an honourable mention. Ian has produced a great write-up of our analysis with fine watercolour maps and Bart’s time-lapse video of parking behaviour.
We mainly used Python, Pandas and Excel for the actual analysis and QGIS for the maps.
I thought it would be an interesting exercise to recreate some of the analysis in Haskell.
First some pragmas and imports.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
>
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE OverloadedStrings #-}
> {-# LANGUAGE ViewPatterns #-}
> {-# LANGUAGE DeriveTraversable #-}
> {-# LANGUAGE DeriveFoldable #-}
> {-# LANGUAGE DeriveFunctor #-}
>
> module WardsOfLondon ( parkingDia ) where
>
> import Database.Shapefile
>
> import Data.Binary.Get
> import qualified Data.ByteString.Lazy as BL
> import qualified Data.ByteString as B
> import Data.Binary.IEEE754
> import Data.Csv hiding ( decode, lookup )
> import Data.Csv.Streaming
> import qualified Data.Vector as V
> import Data.Time
> import qualified Data.Text as T
> import Data.Char
> import qualified Data.Map.Strict as Map
> import Data.Int( Int64 )
> import Data.Maybe ( fromJust, isJust )
> import Data.List ( unfoldr )
>
> import Control.Applicative
> import Control.Monad
>
> import Diagrams.Prelude
> import Diagrams.Backend.Cairo.CmdLine
>
> import System.FilePath
> import System.Directory
> import System.Locale
> import System.IO.Unsafe ( unsafePerformIO )
>
> import Data.Traversable ( Traversable )
> import qualified Data.Traversable as Tr
> import Data.Foldable ( Foldable )
A type synonym to make typing some of our functions a bit more readable (and easier to modify e.g. if we want to use Cairo).
> type Diag = Diagram Cairo R2
The paths to all our data.
SHP files are shape files, a fairly old but widespread map data format that was originally produced by a company called ESRI.
The polygons for the outline of the wards in Westminster. Surely there is a better place to get this rather than using tree canopy data.
The polyline data for all the roads (and other stuff) in the UK. We selected out all the roads in a bounding box for London. Even so plotting these takes about a minute.
The parking data were provided by Westminster Council. The set we consider below was about 4 million lines of cashless parking meter payments (about 1.3G).
> prefix :: FilePath
> prefix = "/Users/dom"
>
> dataDir :: FilePath
> dataDir = "Downloadable/DataScienceLondon"
>
> borough :: FilePath
> borough = "WestMinster"
>
> parkingBorough :: FilePath
> parkingBorough = "ParkingWestminster"
>
> flGL :: FilePath
> flGL = prefix </> dataDir </> "GreaterLondonRoads.shp"
>
> flParkingCashless :: FilePath
> flParkingCashless = "ParkingCashlessDenorm.csv"
The data for payments are contained in a CSV file so we create a record in which to keep the various fields contained therein.
> data Payment = Payment
> { _amountPaid :: LaxDouble
> , paidDurationMins :: Int
> , startDate :: UTCTime
> , _startDay :: DayOfTheWeek
> , _endDate :: UTCTime
> , _endDay :: DayOfTheWeek
> , _startTime :: TimeOfDay
> , _endTime :: TimeOfDay
> , _designationType :: T.Text
> , hoursOfControl :: T.Text
> , _tariff :: T.Text
> , _maxStay :: T.Text
> , spaces :: Maybe Int
> , _street :: T.Text
> , _xCoordinate :: Maybe Double
> , _yCoordinate :: Maybe Double
> , latitude :: Maybe Double
> , longitude :: Maybe LaxDouble
> }
> deriving Show
>
> data DayOfTheWeek = Monday
> | Tuesday
> | Wednesday
> | Thursday
> | Friday
> | Saturday
> | Sunday
> deriving (Read, Show, Enum)
We need to be able to parse the day of the week.
> instance FromField DayOfTheWeek where
> parseField s = read <$> parseField s
The field containing the longitude has values of the form -.1. The CSV parser for Double will reject this so we create our own datatype with a more relaxed parser.
> newtype LaxDouble = LaxDouble { laxDouble :: Double }
> deriving Show
>
> instance FromField LaxDouble where
> parseField = fmap LaxDouble . parseField . addLeading
>
> where
>
> addLeading :: B.ByteString -> B.ByteString
> addLeading bytes =
> case B.uncons bytes of
> Just (c -> '.', _) -> B.cons (o '0') bytes
> Just (c -> '-', rest) -> B.cons (o '-') (addLeading rest)
> _ -> bytes
>
> c = chr . fromIntegral
> o = fromIntegral . ord
We need to be able to parse dates and times.
> instance FromField UTCTime where
> parseField s = do
> f <- parseField s
> case parseTime defaultTimeLocale "%F %X" f of
> Nothing -> fail "Unable to parse UTC time"
> Just g -> return g
>
> instance FromField TimeOfDay where
> parseField s = do
> f <- parseField s
> case parseTime defaultTimeLocale "%R" f of
> Nothing -> fail "Unable to parse time of day"
> Just g -> return g
Finally we can write a parser for our record.
> instance FromRecord Payment where
> parseRecord v
> | V.length v == 18
> = Payment <$>
> v .! 0 <*>
> v .! 1 <*>
> v .! 2 <*>
> v .! 3 <*>
> v .! 4 <*>
> v .! 5 <*>
> v .! 6 <*>
> v .! 7 <*>
> v .! 8 <*>
> v .! 9 <*>
> v .! 10 <*>
> v .! 11 <*>
> v .! 12 <*>
> v .! 13 <*>
> v .! 14 <*>
> v .! 15 <*>
> v .! 16 <*>
> v .! 17
> | otherwise = mzero
To make the analysis simpler, we only look at what might be a typical day, a Thursday in February.
> selectedDay :: UTCTime
> selectedDay = case parseTime defaultTimeLocale "%F %X" "2013-02-28 00:00:00" of
> Nothing -> error "Unable to parse UTC time"
> Just t -> t
It turns out that there are a very limited number of different sorts of hours of control so rather than parse this and calculate the number of control minutes per week, we can just create a simple look up table by hand.
> hoursOfControlTable :: [(T.Text, [Int])]
> hoursOfControlTable = [
> ("Mon - Fri 8.30am - 6.30pm" , [600, 600, 600, 600, 600, 0, 0])
> , ("Mon-Fri 10am - 4pm" , [360, 360, 360, 360, 360, 0, 0])
> , ("Mon - Fri 8.30-6.30 Sat 8.30 - 1.30" , [600, 600, 600, 600, 600, 300, 0])
> , ("Mon - Sat 8.30am - 6.30pm" , [600, 600, 600, 600, 600, 600, 0])
> , ("Mon-Sat 11am-6.30pm " , [450, 450, 450, 450, 450, 450, 0])
> , ("Mon - Fri 8.00pm - 8.00am" , [720, 720, 720, 720, 720, 0, 0])
> , ("Mon - Fri 8.30am - 6.30pm " , [600, 600, 600, 600, 600, 0, 0])
> , ("Mon - Fri 10.00am - 6.30pm\nSat 8.30am - 6.30pm", [510, 510, 510, 510, 510, 600, 0])
> , ("Mon-Sun 10.00am-4.00pm & 7.00pm - Midnight" , [660, 660, 660, 660, 660, 660, 660])
> ]
Now we create a record in which to record the statistics in which we are interested:
Number of times a lot is used.
Number of usage minutes. In reality this is the amount of minutes purchased and people often leave a bay before their ticket expires so this is just a proxy.
The hours of control for the lot.
The number of bays in the lot.
N.B. The !’s are really important otherwise we get a space leak. In more detail, these are strictness annotations which force the record to be evaluated rather than be carried around unevaluated (taking up unnecessary space) until needed.
> data LotStats = LotStats { usageCount :: !Int
> , usageMins :: !Int64
> , usageControlTxt :: !T.Text
> , usageSpaces :: !(Maybe Int)
> }
> deriving Show
As we work our way through the data we need to update our statistics.
> updateStats :: LotStats -> LotStats -> LotStats
> updateStats s1 s2 = LotStats { usageCount = (usageCount s1) + (usageCount s2)
> , usageMins = (usageMins s1) + (usageMins s2)
> , usageControlTxt = usageControlTxt s2
> , usageSpaces = usageSpaces s2
> }
>
> initBayCountMap :: Map.Map (Pair Double) LotStats
> initBayCountMap = Map.empty
We are going to be working with co-ordinates which are pairs of numbers so we need a data type in which to keep them. Possibly overkill.
> data Pair a = Pair { xPair :: !a, yPair :: !a }
> deriving (Show, Eq, Ord, Functor, Foldable, Traversable)
Functions to get bounding boxes.
> getPair :: Get a -> Get (a,a)
> getPair getPart = do
> x <- getPart
> y <- getPart
> return (x,y)
>
> getBBox :: Get a -> Get (BBox a)
> getBBox getPoint = do
> bbMin <- getPoint
> bbMax <- getPoint
> return (BBox bbMin bbMax)
>
> bbox :: Get (BBox (Double, Double))
> bbox = do
> shpFileBBox <- getBBox (getPair getFloat64le)
> return shpFileBBox
>
> getBBs :: BL.ByteString -> BBox (Double, Double)
> getBBs = runGet $ do
> _ <- getShapeType32le
> bbox
>
> isInBB :: (Ord a, Ord b) => BBox (a, b) -> BBox (a, b) -> Bool
> isInBB bbx bby = ea >= eb && wa <= wb &&
> sa >= sb && na <= nb
> where
> (ea, sa) = bbMin bbx
> (wa, na) = bbMax bbx
> (eb, sb) = bbMin bby
> (wb, nb) = bbMax bby
>
> combineBBs :: (Ord a, Ord b) => BBox (a, b) -> BBox (a, b) -> BBox (a, b)
> combineBBs bbx bby = BBox { bbMin = (min ea eb, min sa sb)
> , bbMax = (max wa wb, max na nb)
> }
> where
> (ea, sa) = bbMin bbx
> (wa, na) = bbMax bbx
> (eb, sb) = bbMin bby
> (wb, nb) = bbMax bby
A function to get plotting information from the shape file.
> getRecs :: BL.ByteString ->
> [[(Double, Double)]]
> getRecs = runGet $ do
> _ <- getShapeType32le
> _ <- bbox
> nParts <- getWord32le
> nPoints <- getWord32le
> parts <- replicateM (fromIntegral nParts) getWord32le
> points <- replicateM (fromIntegral nPoints) (getPair getFloat64le)
> return (getParts (map fromIntegral parts) points)
>
> getParts :: [Int] -> [a] -> [[a]]
> getParts offsets ps = unfoldr g (gaps, ps)
> where
> gaps = zipWith (-) (tail offsets) offsets
> g ( [], []) = Nothing
> g ( [], xs) = Just (xs, ([], []))
> g (n:ns, xs) = Just (take n xs, (ns, drop n xs))
We need to be able to filter out e.g. roads that are not in a given bounding box.
> recsOfInterest :: BBox (Double, Double) -> [ShpRec] -> [ShpRec]
> recsOfInterest bb = filter (flip isInBB bb . getBBs . shpRecData)
A function to process each ward in Westminster.
> processWard :: [ShpRec] -> FilePath ->
> IO ([ShpRec], ([[(Double, Double)]], BBox (Double, Double)))
> processWard recDB fileName = do
> input <- BL.readFile $ prefix </> dataDir </> borough </> fileName
> let (hdr, recs) = runGet getShpFile input
> bb = shpFileBBox hdr
> let ps = head $ map getRecs (map shpRecData recs)
> return $ (recsOfInterest bb recDB, (ps, bb))
We want to draw roads and ward boundaries.
> colouredLine :: Double -> Colour Double -> [(Double, Double)] -> Diag
> colouredLine thickness lineColour xs = (fromVertices $ map p2 xs) #
> lw thickness #
> lc lineColour
And we want to draw parking lots with the hue varying according to how heavily they are utilised.
> bayDots :: [Pair Double] -> [Double] -> Diag
> bayDots xs bs = position (zip (map p2 $ map toPair xs) dots)
> where dots = map (\b -> circle 0.0005 # fcA (blend b c1 c2) # lw 0.0) bs
> toPair p = (xPair p, yPair p)
> c1 = darkgreen `withOpacity` 0.7
> c2 = lightgreen `withOpacity` 0.7
Update the statistics until we run out of data.
> processCsv :: Map.Map (Pair Double) LotStats ->
> Records Payment ->
> Map.Map (Pair Double) LotStats
> processCsv m rs = case rs of
> Cons u rest -> case u of
> Left err -> error err
> Right val -> case Tr.sequence $ Pair (laxDouble <$> longitude val) (latitude val) of
> Nothing -> processCsv m rest
> Just v -> if startDate val == selectedDay
> then processCsv (Map.insertWith updateStats v delta m) rest
> else processCsv m rest
> where
> delta = LotStats { usageCount = 1
> , usageMins = fromIntegral $ paidDurationMins val
> , usageControlTxt = hoursOfControl val
> , usageSpaces = spaces val
> }
> Nil mErr x -> if BL.null x
> then m
> else error $ "Nil: " ++ show mErr ++ " " ++ show x
> availableMinsThu :: LotStats -> Maybe Double
> availableMinsThu val =
> fmap fromIntegral $
> fmap (!!(fromEnum Thursday)) $
> flip lookup hoursOfControlTable $
> usageControlTxt val
Now for the main function.
> parkingDiaM :: IO Diag
> parkingDiaM = do
Read in the 4 million records lazily.
> parkingCashlessCsv <- BL.readFile $
> prefix </>
> dataDir </>
> parkingBorough </>
> flParkingCashless
Create our statistics.
> let bayCountMap = processCsv initBayCountMap (decode False parkingCashlessCsv)
>
> vals = Map.elems bayCountMap
Calculate the available minutes for each bay.
> availableMinsThus :: [Maybe Double]
> availableMinsThus = zipWith f (map availableMinsThu vals)
> (map (fmap fromIntegral . usageSpaces) vals)
> where
> f x y = (*) <$> x <*> y
Calculate the actual minutes used for each lot and the usage which determine the hue of the colour of the dot representing the lot on the map.
> actualMinsThu :: [Double]
> actualMinsThu =
> map fromIntegral $
> map usageMins vals
>
> usage :: [Maybe Double]
> usage = zipWith f actualMinsThu availableMinsThus
> where
> f x y = (/) <$> pure x <*> y
We will need to the co-ordinates of each lot in order to be able to plot it.
> let parkBayCoords :: [Pair Double]
> parkBayCoords = Map.keys bayCountMap
Get the ward shape files.
> fs <- getDirectoryContents $ prefix </> dataDir </> borough
> let wardShpFiles = map (uncurry addExtension) $
> filter ((==".shp"). snd) $
> map splitExtension fs
Get the London roads shape file.
> inputGL <- BL.readFile flGL
> let recsGL = snd $ runGet getShpFile inputGL
Get the data we wish to plot from each ward shape file.
> rps <- mapM (processWard recsGL) wardShpFiles
Get the roads inside the wards.
> let zs = map (getRecs . shpRecData) $ concat $ map fst rps
And create blue diagram elements for each road.
> ps :: [[Diag]]
> ps = map (map (colouredLine 0.0001 blue)) zs
Create diagram elements for each ward boundary.
> qs :: [[Diag]]
> qs = map (map (colouredLine 0.0003 navajowhite)) (map (fst. snd) rps)
Westminster is located at about 51 degrees North. We want to put a background colour on the map so either we need to move Westminster to be at the origin or create a background rectangle centred on Westminster. We do the former. We create a rectangle which is slightly bigger than the bounding box of Westminster. And we translate everything so that the South West corner of the bounding box of Westminster is the origin.
> let bbWestminster = foldr combineBBs (BBox (inf, inf) (negInf, negInf)) $
> map (snd . snd) rps
> where
> inf = read "Infinity"
> negInf = read "-Infinity"
>
> let (ea, sa) = bbMin bbWestminster
> (wa, na) = bbMax bbWestminster
> wmHeight = na - sa
> wmWidth = wa - ea
Create the background.
> wmBackground = translateX (wmWidth / 2.0) $
> translateY (wmHeight / 2.0) $
> scaleX 1.1 $
> scaleY 1.1 $
> rect wmWidth wmHeight # fcA (yellow `withOpacity` 0.1) # lw 0.0
Plot the streets.
> wmStreets = translateX (negate ea) $
> translateY (negate sa) $
> mconcat (mconcat ps)
Plot the parking lots.
> wmParking = translateX (negate ea) $
> translateY (negate sa) $
> uncurry bayDots $
> unzip $
> map (\(x, y) -> (x, fromJust y)) $
> filter (isJust . snd) $
> zip parkBayCoords usage
Plot the ward boundaries.
> wmWards = translateX (negate ea) $
> translateY (negate sa) $
> mconcat (mconcat qs)
>
> return $ wmBackground <>
> wmWards <>
> wmStreets <>
> wmParking
Sadly we have to use unsafePerformIO in order to be able to create the post using BlogLiteratelyD.
> parkingDia :: Diag
> parkingDia = unsafePerformIO parkingDiaM
And now we can see all the parking lots in Westminster as green dots. The darkness represents how heavily utilised they are. The thick gold lines delineate the wards in Westminster. In case it isn’t obvious the blue lines are the roads. The Thames, Hyde Park and Regent’s Park are fairly easy to spot. Less easy to spot but still fairly visible are Buckingham Palace and Green Park.
We appear to need to use ghc -O2 otherwise we get a spaceleak.
We didn’t explicitly need the equivalent of pandas. It would be interesting to go through the Haskell and Python code and see where we used pandas and what the equivalent was in Haskell.
Python and R seem more forgiving about data formats e.g. they handle -.1 where Haskell doesn’t. Perhaps this should be in the Haskell equivalent of pandas.
The intended audience of this article is someone who knows something about Machine Learning and Artifical Neural Networks (ANNs) in particular and who recalls that fitting an ANN required a technique called backpropagation. The goal of this post is to refresh the reader’s knowledge of ANNs and backpropagation and to show that the latter is merely a specialised version of automatic differentiation, a tool that all Machine Learning practitioners should know about and have in their toolkit.
The problem is simple to state: we have a (highly) non-linear function, the cost function of an Artificial Neural Network (ANN), and we wish to minimize this so as to estimate the parameters / weights of the function.
In order to minimise the function, one obvious approach is to use steepest descent: start with random values for the parameters to be estimated, find the direction in which the the function decreases most quickly, step a small amount in that direction and repeat until close enough.
But we have two problems:
We have an algorithm or a computer program that calculates the non-linear function rather than the function itself.
The function has a very large number of parameters, hundreds if not thousands.
One thing we could try is bumping each parameter by a small amount to get partial derivatives numerically
But this would mean evaluating our function many times and moreover we could easily get numerical errors as a result of the vagaries of floating point arithmetic.
As an alternative we could turn our algorithm or computer program into a function more recognisable as a mathematical function and then compute the differential itself as a function either by hand or by using a symbolic differentiation package. For the complicated expression which is our mathematical function, the former would be error prone and the latter could easily generate something which would be even more complex and costly to evaluate than the original expression.
The standard approach is to use a technique called backpropagation and the understanding and application of this technique forms a large part of many machine learning lecture courses.
Since at least the 1960s techniques for automatically differentiating computer programs have been discovered and re-discovered. Anyone who knows about these techniques and reads about backpropagation quickly realises that backpropagation is just automatic differentiation and steepest descent.
This article is divided into
Refresher on neural networks and backpropagation;
Methods for differentiation;
Backward and forward automatic differentiation and
Concluding thoughts.
The only thing important to remember throughout is the chain rule
in alternative notation
where . More suggestively we can write
where it is understood that and are evaluated at and is evaluated at .
For example,
Sadly I cannot recall all the sources I looked at in order to produce this article but I have made heavy use of the following.
Here is our model, with the input, the predicted output and the actual output and the weights in the -th layer. We have concretised the transfer function as but it is quite popular to use the function.
with the loss or cost function
The diagram below depicts a neural network with a single hidden layer.
In order to apply the steepest descent algorithm we need to calculate the differentials of this latter function with respect to the weights, that is, we need to calculate
Applying the chain rule
Since
we have
Defining
we obtain
Finding the for each layer is straightforward: we start with the inputs and propagate forward. In order to find the we need to start with the outputs a propagate backwards:
For the output layer we have (since )
For a hidden layer using the chain rule
Now
so that
and thus
Summarising
We calculate all and for each layer starting with the input layer and propagating forward.
We evaluate in the output layer using .
We evaluate in each layer using starting with the output layer and propagating backwards.
Use to obtain the required derivatives in each layer.
For the particular activation function we have . And finally we can use the partial derivatives to step in the right direction using steepest descent
where is the step size aka the learning rate.
So now we have an efficient algorithm for differentiating the cost function for an ANN and thus estimating its parameters but it seems quite complex. In the introduction we alluded to other methods of differentiation. Let us examine those in a bit more detail before moving on to a general technique for differentiating programs of which backpropagation turns out to be a specialisation.
Consider the function then its differential and we can easily compare a numerical approximation of this with the exact result. The numeric approximation is given by
In theory we should get a closer and closer approximation as epsilon decreases but as the chart below shows at some point (with ) the approximation worsens as a result of the fact that we are using floating point arithmetic. For a complex function such as one which calculates the cost function of an ANN, there is a risk that we may end up getting a poor approximation for the derivative and thus a poor estimate for the parameters of the model.
Suppose we have the following program (written in Python)
import numpy as np
def many_sines(x):
y = x
for i in range(1,7):
y = np.sin(x+y)
return y
When we unroll the loop we are actually evaluating
Now suppose we want to get the differential of this function. Symbolically this would be
Typically the non-linear function that an ANN gives is much more complex than the simple function given above. Thus its derivative will correspondingly more complex and therefore expensive to compute. Moreover calculating this derivative by hand could easily introduce errors. And in order to have a computer perform the symbolic calculation we would have to encode our cost function somehow so that it is amenable to this form of manipulation.
Traditionally, forward mode is introduced first as this is considered easier to understand. We introduce reverse mode first as it can be seen to be a generalization of backpropagation.
Consider the function
Let us write this a data flow graph.
We can thus re-write our function as a sequence of simpler functions in which each function only depends on variables earlier in the sequence.
In our particular example, since do not depend on
Further does not depend on so we also have
Now things become more interesting as and both depend on and so the chain rule makes an explicit appearance
Carrying on
Note that having worked from top to bottom (the forward sweep) in the graph to calculate the function itself, we have to work backwards from bottom to top (the backward sweep) to calculate the derivative.
So provided we can translate our program into a call graph, we can apply this procedure to calculate the differential with the same complexity as the original program.
The pictorial representation of an ANN is effectively the data flow graph of the cost function (without the final cost calculation itself) and its differential can be calculated as just being identical to backpropagation.
An alternative method for automatic differentiation is called forward mode and has a simple implementation. Let us illustrate this using Haskell 98. The actual implementation is about 20 lines of code.
First some boilerplate declarations that need not concern us further.
> {-# LANGUAGE NoMonomorphismRestriction #-}
>
> module AD (
> Dual(..)
> , f
> , idD
> , cost
> , zs
> ) where
>
> default ()
Let us define dual numbers
> data Dual = Dual Double Double
> deriving (Eq, Show)
We can think of these pairs as first order polynomials in the indeterminate , such that
Thus, for example, we have
Notice that these equations implicitly encode the chain rule. For example, we know, using the chain rule, that
And using the example equations above we have
Notice that dual numbers carry around the calculation and the derivative of the calculation. To actually evaluate at a particular value, say 2, we plug in 2 for and 1 for
Thus the derivative of at 2 is .
With a couple of helper functions we can implement this rule () by making Dual an instance of Num, Fractional and Floating.
> constD :: Double -> Dual
> constD x = Dual x 0
>
> idD :: Double -> Dual
> idD x = Dual x 1.0
Let us implement the rules above by declaring Dual to be an instance of Num. A Haskell class such as Num simply states that it is possible to perform a (usually) related collection of operations on any type which is declared as an instance of that class. For example, Integer and Double are both types which are instances on Num and thus one can add, multiply, etc. values of these types (but note one cannot add an Integer to a Double without first converting a value of the former to a value of the latter).
As an aside, we will never need the functions signum and abs and declare them as undefined; in a robust implementation we would specify an error if they were ever accidentally used.
> instance Num Dual where
> fromInteger n = constD $ fromInteger n
> (Dual x x') + (Dual y y') = Dual (x + y) (x' + y')
> (Dual x x') * (Dual y y') = Dual (x * y) (x * y' + y * x')
> negate (Dual x x') = Dual (negate x) (negate x')
> signum _ = undefined
> abs _ = undefined
We need to be able to perform division on Dual so we further declare it to be an instance of Fractional.
> instance Fractional Dual where
> fromRational p = constD $ fromRational p
> recip (Dual x x') = Dual (1.0 / x) (-x' / (x * x))
We want to be able to perform the same operations on Dual as we can on Float and Double. Thus we make Dual an instance of Floating which means we can now operate on values of this type as though, in some sense, they are the same as values of Float or Double (in Haskell 98 only instances for Float and Double are defined for the class Floating).
> instance Floating Dual where
> pi = constD pi
> exp (Dual x x') = Dual (exp x) (x' * exp x)
> log (Dual x x') = Dual (log x) (x' / x)
> sqrt (Dual x x') = Dual (sqrt x) (x' / (2 * sqrt x))
> sin (Dual x x') = Dual (sin x) (x' * cos x)
> cos (Dual x x') = Dual (cos x) (x' * (- sin x))
> sinh (Dual x x') = Dual (sinh x) (x' * cosh x)
> cosh (Dual x x') = Dual (cosh x) (x' * sinh x)
> asin (Dual x x') = Dual (asin x) (x' / sqrt (1 - x*x))
> acos (Dual x x') = Dual (acos x) (x' / (-sqrt (1 - x*x)))
> atan (Dual x x') = Dual (atan x) (x' / (1 + x*x))
> asinh (Dual x x') = Dual (asinh x) (x' / sqrt (1 + x*x))
> acosh (Dual x x') = Dual (acosh x) (x' / (sqrt (x*x - 1)))
> atanh (Dual x x') = Dual (atanh x) (x' / (1 - x*x))
That’s all we need to do. Let us implement the function we considered earlier.
> f = sqrt . (* 3) . sin
The compiler can infer its type
ghci> :t f
f :: Floating c => c -> c
We know the derivative of the function and can also implement it directly in Haskell.
> f' x = 3 * cos x / (2 * sqrt (3 * sin x))
Now we can evaluate the function along with its automatically calculated derivative and compare that with the derivative we calculated symbolically by hand.
ghci> f $ idD 2
Dual 1.6516332160855343 (-0.3779412091869595)
ghci> f' 2
-0.3779412091869595
To see that we are not doing symbolic differentiation (it’s easy to see we are not doing numerical differentiation) let us step through the actual evaluation.
In order not to make this blog post too long let us apply AD to finding parameters for a simple regression. The application to ANNs is described in a previous blog post. Note that in a real application we would use the the Haskell AD and furthermore use reverse AD as in this case it would be more efficient.
First our cost function
> cost m c xs ys = (/ (2 * (fromIntegral $ length xs))) $
> sum $
> zipWith errSq xs ys
> where
> errSq x y = z * z
> where
> z = y - (m * x + c)
ghci> :t cost
cost :: Fractional a => a -> a -> [a] -> [a] -> a
Some test data
> xs = [1,2,3,4,5,6,7,8,9,10]
> ys = [3,5,7,9,11,13,15,17,19,21]
and a learning rate
> gamma = 0.04
Now we create a function of the two parameters in our model by applying the cost function to the data. We need the (partial) derivatives of both the slope and the offset.
> g m c = cost m c xs ys
Now we can take use our Dual numbers to calculate the required partial derivatives and update our estimates of the parameter. We create a stream of estimates.
> zs = (0.1, 0.1) : map f zs
> where
>
> deriv (Dual _ x') = x'
>
> f (c, m) = (c - gamma * cDeriv, m - gamma * mDeriv)
> where
> cDeriv = deriv $ g (constD m) $ idD c
> mDeriv = deriv $ flip g (constD c) $ idD m
And we can calculate the cost of each estimate to check our algorithm converges and then take the the estimated parameters when the change in cost per iteration has reached an acceptable level.
ghci> take 2 $ drop 1000 $ map (\(c, m) -> cost m c xs ys) zs
[1.9088215184565296e-9,1.876891490619424e-9]
ghci> take 2 $ drop 1000 zs
[(0.9998665320141327,2.0000191714150106),(0.999867653022265,2.0000190103927853)]
Perhaps AD is underused because of efficiency?
It seems that the Financial Services industry is aware that AD is more efficient than current practice albeit the technique is only slowly permeating. Order of magnitude improvements have been reported.
Smoking Adjoints: fast evaluation of Greeks in Monte Carlo Calculations
Adjoints and automatic (algorithmic) differentiation in computational finance
Perhaps AD is slowly permeating into Machine Learning as well but there seem to be no easy to find benchmarks.
If it were only possible to implement automatic differentiation in Haskell then its applicability would be somewhat limited. Fortunately this is not the case and it can be used in many languages.
In general, there are three different approaches:
Operator overloading: available for Haskell and C++. See the Haskell ad package and the C++ FADBAD approach using templates.
Source to source translators: available for Fortran, C and other languages e.g., ADIFOR, TAPENADE and see the wikipedia entry for a more comprehensive list.
New languages with built-in AD primitives. I have not listed any as it seems unlikely that anyone practicing Machine Learning would want to transfer their existing code to a research language. Maybe AD researchers could invest time in understanding what language feature improvements are needed to support AD natively in existing languages.
The planet Mercury has a highly elliptical orbit with a perihelion of about 0.31 AU and an aphelion of about 0.47 AU. This ellipse is not stationary but itself rotates about the Sun, a phenomenon known as the precession of the perihelion. A calculation carried out using Newtonian mechanics gives a value at variance with observation. The deficit is explained using General Relativity although we do not apply the relativistic correction in this article.
Just to give a flavour of the Haskell, we will have to calculate values of the infinite series of Legendre Polynomials evaluated at 0. We have
Since we are dealing with infinite series we will want to define this co-recursively. We could use the Stream package but let us stay with lists.
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
>
> {-# LANGUAGE NoMonomorphismRestriction #-}
>
> module Legendre (
> legendre0s
> , main) where
>
> import Data.List
> import Text.Printf
> import Initial
>
> legendre0s :: [Rational]
> legendre0s = interleave legendre0Evens legendre0Odds
> where
> legendre0Evens = 1 : zipWith f [1..] legendre0Evens
> where f n p = negate $ p * (2 * n * (2 * n - 1)) / (2^2 * n^2)
> legendre0Odds = 0 : legendre0Odds
>
> interleave :: [a] -> [a] -> [a]
> interleave = curry $ unfoldr g
> where
> g ([], _) = Nothing
> g (x:xs, ys) = Just (x, (ys, xs))
And now we can calculate any number of terms we need
ghci> take 10 $ legendre0s
[1 % 1,0 % 1,(-1) % 2,0 % 1,3 % 8,0 % 1,(-5) % 16,0 % 1,35 % 128,0 % 1]
The reader wishing to skip the physics and the mathematical derivation can go straight to the section on implementation
This article calculates the precession in Haskell using Newtonian methods. Over a long enough period, the gravitational effect of each outer planet on Mercury can be considered to be the same as a ring with the same mass as the planet; in other words we assume that the mass of each planet has been smeared out over its orbit. Probably one can model Saturn’s rings using this technique but that is certainly the subject of a different blog post.
More specifically, we model the mass of the ring as being totally concentrated on one particular value of and one particular value of with total mass .
where is the Dirac delta function. Thus the density of our ring is
This blog follows the exposition given in [@Fitz:Newtonian:Dynamics] and [@brown:SpaceTime] concretized for the precession of the perihelion of Mercury with some of the elisions expanded. More details on Legendre Polynomials can be found in [@Bowles:Legendre:Polynomials].
We consider axially symmetric mass distributions in spherical polar co-ordinates where runs from to , (the polar angle) runs from to and (the azimuthal angle) runs from to .
For clarity we give their conversion to cartesian co-ordinates.
The volume element in spherical polar co-ordinates is given by .
The gravitational potential given by masses each of mass and at position is:
If instead of point masses, we have a mass distribution then
where is the volume element.
If the mass distribution is axially symmetric then so will the potential. In spherical polar co-ordinates:
where denotes the average over the azimuthal angle.
Expanding the middle term on the right hand size and noting that :
Writing and noting that
where are the Legendre Polynomials we see that when
Applying the Spherical Harmonic Addition Theorem (or see [@arfken]) we obtain
Similarly when we obtain
Substituting into the equation for the potential for axially symmetric mass distributions gives us
where
Note that the first integral has limits to and the second has limits to .
It is well known that the Legendre Polynomials form an orthogonal and complete set for continuous functions. Indeed
Thus we can write
Using the orthogonality condition we have
Hence
We now substitute in the axially symmetric density of a ring
Substituting again
Thus for
And for
Thus at and we have
and for
Let be the mass of the Sun then the potential due to all the Sun and all planets at a distance (excluding the planet positioned at ) is
An apsis is the closest and furthest point that a planet reaches i.e. the perihelion and the aphelion. Without the perturbing influence of the outer planets the angle between these points, the apsidal angle, would be . In presence of the outer planets this is no longer the case.
Writing down the Lagrangian for a single planet we have
where is the total potential due to the Sun and the other planets (as calculated above). is ignorable so we have a conserved quantity . We write for which is also conserved.
Applying Lagrange’s equation for we have
Thus the radial equation of motion is
To make further progress let us take just one term for a ring outside the planet of consideration and use the trick given in [@brown:SpaceTime]. Writing for the aphelion, for the perihelion and for the major radius we have
Defining by writing
we have
giving
Using the Taylor approximation
Thus
Then since
We have
It is a nuisance to be continually writing . From now on this is denoted by . Using
We obtain
We can therefore re-write the radial equation of motion approximately as
Now let us re-write the equation of motion as a relation between and .
Thus we have
Letting we can re-write this as
This is the equation for simple harmonic motion with and since for a circular orbit we can write
and therefore the change in radians per revolution is
To convert this to arc-seconds per century we apply a conversion factor
where 414.9 is the number of orbits of Mercury per century.
The implementation is almost trivial given that we have previously calculated the Legendre Polynomials (evaluated at 0). First let us make the code a bit easier to read by defining arithmetic pointwise (note that for polynomials we would not want to do this).
> instance Num a => Num [a] where
> (*) = zipWith (*)
> (+) = zipWith (+)
> abs = error "abs makes no sense for infinite series"
> signum = error "signum makes no sense for infinite series"
> fromInteger = error "fromInteger makes no sense for infinite series"
Next we define our conversion function so that we can compare our results against those obtained by Le Verrier.
> conv :: Floating a => a -> a
> conv x = x * 414.9 * (360 / (2 * pi)) * 3600
The main calculation for which we can take any number of terms.
> perturbations :: Double -> Double -> Double -> Double -> [Double]
> perturbations mRing mSun planetR ringR =
> map ((pi * (mRing / mSun)) *) xs
> where
> xs = (map (^2) $ map fromRational legendre0s) *
> (map fromIntegral [0..]) *
> (map fromIntegral [1..]) *
> (map ((planetR / ringR)^) [1..])
Arbitrarily, let us take 20 terms.
> predict :: Double -> Double -> Double -> Double
> predict x y z = sum $
> map conv $
> take 20 $
> perturbations x sunMass y z
And now let us compare our calculations with Le Verrier’s.
> main :: IO () > main = do > printf "Venus %3.1f %3.1f\n" > (280.6 :: Double) > (predict venusMass mercuryMajRad venusMajRad) > printf "Earth %3.1f %3.1f\n" > (83.6 :: Double) > (predict earthMass mercuryMajRad earthMajRad) > printf "Mars %3.1f %3.1f\n" > (2.6 :: Double) > (predict marsMass mercuryMajRad marsMajRad) > printf "Jupiter %3.1f %3.1f\n" > (152.6 :: Double) > (predict jupiterMass mercuryMajRad jupiterMajRad)
ghci> main Venus 280.6 286.0 Earth 83.6 95.3 Mars 2.6 2.4 Jupiter 152.6 160.1
Not too bad.
Note the lectures by Fitzpatrick [@Fitz:Newtonian:Dynamics] use a different approximation for the apsidal angle
We do not derive this here but note that the expansion and approximation are not entirely straightforward and are given here for completenes. Note that the result derived this way is identical to the result obtained in the main body of the article.
The radial force is given by
We also have
Thus
Re-arranging
we note that the last two terms can be re-written with a numerator of
and a denominator which is dominated by the . Thus
Since this term is we can expand the term of interest further
Arfken, George. 1985. Mathematical Methods for Physicists. Third.. Orlando: ap.
Bowles, Robert. “Properties of Legendre Polynomials.” http://www.ucl.ac.uk/~ucahdrb/MATHM242/OutlineCD2.pdf.
Brown, Kevin. 2013. Physics in space and time. lulu.com.
Fitzpatrick, Richard. 1996. “Newtonian Dynamics.” http://farside.ph.utexas.edu/teaching/336k/lectures.