Cartography in Haskell

Introduction

This blog started off life as a blog post on how I use nix but somehow transformed itself into a “how I do data visualisation” blog post. The nix is still here though quietly doing its work in the background.

Suppose you want to analyze your local election results and visualize them using a choropleth but you’d like to use Haskell. You could try using the shapefile package but you will have to do a lot of the heavy lifting yourself (see here for an extended example). But you know folks using R can produce such things in a few lines of code and you also know from experience that unless you are very disciplined large R scripts can quickly become unmaintainable. Can you get the best of both worlds? It seems you can if you use inline-r. But now you have a bit of configuration problem: how do you make sure that all the Haskell packages that you need and all the R packages that you need are available and how can you make sure that your results are reproducible? My answer to use nix.

Acknowledgements

I’d like to thank all my colleagues at Tweag who have had the patience to indulge my questions especially zimbatm and also all folks on the #nixos irc channel.

Downloading and Building

The sources are on github.

You’ll have to build my Include pandoc filter somehow and then

pandoc -s HowIUseNix.lhs --filter=./Include -t markdown+lhs > HowIUseNixExpanded.lhs
BlogLiterately HowIUseNixExpanded.lhs > HowIUseNixExpanded.html

Analysing the Data

Before looking at the .nix file I use, let’s actually analyse the data. First some preamble

> {-# OPTIONS_GHC -Wall                    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# LANGUAGE TemplateHaskell   #-}
> {-# LANGUAGE QuasiQuotes       #-}
> {-# LANGUAGE DataKinds         #-}
> {-# LANGUAGE FlexibleContexts  #-}
> {-# LANGUAGE TypeOperators     #-}
> {-# LANGUAGE OverloadedStrings #-}
> import Prelude as P
> import qualified Language.R as R
> import Language.R.QQ
> import System.Directory
> import System.FilePath
> import Control.Monad
> import Data.List (stripPrefix, nub)
> import qualified Data.Foldable as Foldable
> import Frames hiding ( (:&) )
> import Control.Lens
> import qualified Data.FuzzySet as Fuzzy
> import Data.Text (pack, unpack)
> import qualified Control.Foldl as Foldl
> import Pipes.Prelude (fold)

We can scrape the data for the Kington on Thames local elections and put it in a CSV file. Using the Haskell equivalent of data frames / pandas, Frames, we can “read” in the data. I say “read” in quotes because if this were a big data set (it’s not) then it might not fit in memory and we’d have to process it as a stream. I’ve included the sheet in the github repo so you can try reproducing my results.

> tableTypes "ElectionResults" "data/Election Results 2018 v1.csv"
> loadElectionResults :: IO (Frame ElectionResults)
> loadElectionResults = inCoreAoS (readTable "data/Election Results 2018 v1.csv")

We could analyse the data set in order to find the ward (aka electoral district) names but I happen to know them.

> wardNames :: [Text]
> wardNames = [ "Berrylands"
>             , "Norbiton"
>             , "Old Malden"
>             , "Coombe Hill"
>             , "Beverley"
>             , "Chessington North & Hook"
>             , "Tolworth and Hook Rise"
>             , "Coombe Vale"
>             , "St James"
>             , "Tudor"
>             , "Alexandra"
>             , "Canbury"
>             , "Surbiton Hill"
>             , "Chessington South"
>             , "Grove"
>             , "St Marks"
>             ]

We need a query to get both the total number of votes cast in the ward and the number of votes cast for what is now the party which controls the borough. We can create two filters to run in parallel, summing as we go.

> getPartyWard :: Text -> Text -> Foldl.Fold ElectionResults (Int, Int)
> getPartyWard p w =
>   (,) <$> (Foldl.handles (filtered (\x -> x ^. ward == w)) .
>            Foldl.handles votes) Foldl.sum
>       <*> (Foldl.handles (filtered (\x -> x ^. ward == w &&
>                                           x ^. party == p)) .
>            Foldl.handles votes) Foldl.sum

We can combine all the filters and sums into one big query where they all run in parallel.

> getVotes :: Foldl.Fold ElectionResults [(Int, Int)]
> getVotes = sequenceA $ map (getPartyWard "Liberal Democrat") wardNames

I downloaded the geographical data from here but I imagine there are many other sources.

First we read all the shapefiles in the directory we downloaded.

> root :: FilePath
> root = "data/KingstonuponThames"
> main :: IO ()
> main = do
>   ns <- runSafeT $ Foldl.purely fold getVotes (readTable "data/Election Results 2018 v1.csv")
>   let ps :: [Double]
>       ps = map (\(x, y) -> 100.0 * fromIntegral y / fromIntegral x) ns
>   let percentLD = zip (map unpack wardNames) ps
> 
>   ds <- getDirectoryContents root
>   let es = filter (\x -> takeExtension x == ".shp") ds
>   let fs = case xs of
>              Nothing -> error "Ward prefix does not match"
>              Just ys -> ys
>            where
>              xs = sequence $
>                   map (fmap dropExtension) $
>                   map (stripPrefix "KingstonuponThames_") es

Sadly, when we get the ward names from the file names they don’t match the ward names in the electoral data sheet. Fortunately, fuzzy matching resolves this problem.

>   rs <- loadElectionResults
>   let wardSet :: Fuzzy.FuzzySet
>       wardSet = Fuzzy.fromList . nub . Foldable.toList . fmap (^. ward) $ rs
> 
>   let gs = case sequence $ map (Fuzzy.getOne wardSet) $ map pack fs of
>              Nothing -> error "Ward name and ward filename do not match"
>              Just xs -> xs

Rather than making the colour of each ward vary continuously depending on the percentage share of the vote, let’s create 4 buckets to make it easier to see at a glance how the election results played out.

>   let bucket :: Double -> String
>       bucket x | x < 40    = "whitesmoke"
>                | x < 50    = "yellow1"
>                | x < 60    = "yellow2"
>                | otherwise = "yellow3"
> 
>   let cs :: [String]
>       cs = case sequence $ map (\w -> lookup w percentLD) (map unpack gs) of
>              Nothing -> error "Wrong name in table"
>              Just xs -> map bucket xs

Now we move into the world of R, courtesy of inline-r. First we load up the required libraries. Judging by urls like this, it is not always easy to install the R binding to GDAL (Geospatial Data Abstraction Library). But with nix, we don’t even have to use install.packages(...).

>   R.runRegion $ do
>     [r| library(rgdal)   |]
>     [r| library(ggplot2) |]
>     [r| library(dplyr)   |]

Next we create an empty map (I have no idea what this would look like if it were plotted).

>     map0 <- [r| ggplot() |]

And now we can fold over the shape data for each ward and its colour to build the basic map we want.

>     mapN <- foldM
>       (\m (f, c) -> do
>           let wward = root </> f
>           shapefileN <- [r| readOGR(wward_hs) |]
>           shapefileN_df <- [r| fortify(shapefileN_hs) |]
>           withColours <- [r| mutate(shapefileN_df_hs, colour=rep(c_hs, nrow(shapefileN_df_hs))) |]
>           [r| m_hs +
>               geom_polygon(data = withColours_hs,
>                            aes(x = long, y = lat, group = group, fill=colour),
>                            color = 'gray', size = .2) |]) map0 (zip es cs)

Now we can annotate the map with all the details needed to make it intelligible.

>     map_projected <- [r| mapN_hs + coord_map() +
>                          ggtitle("Kingston on Thames 2018 Local Election\nLib Dem Percentage Vote by Ward") +
>                          theme(legend.position="bottom", plot.title = element_text(lineheight=.8, face="bold", hjust=0.5)) +
>                          scale_fill_manual(values=c("whitesmoke", "yellow1", "yellow2", "yellow3"),
>                                            name="Vote (%)",
>                                            labels=c("0%-40% Lib Dem", "40%-50% Lib Dem", "50%-60% Lib Dem", "60%-100% Lib Dem")) |]

And finally save it in a file.

>     [r| jpeg(filename="diagrams/kingston.jpg") |]
>     [r| print(map_projected_hs)               |]
>     [r| dev.off()                             |]
>     return ()

How I Really Use nix

I haven’t found a way of interspersing text with nix code so here are some comments.

  1. I use neither all the Haskell packages listed nor all the R packages. I tend to put a shell.nix file in the directory where I am working and then nix-shell shell.nix -I nixpkgs=~/nixpkgs as at the moment I seem to need packages which were recently added to nixpkgs.

  2. I couldn’t get the tests for inline-r to run on MACos so I added dontCheck.

  3. Some Haskell packages in nixpkgs have their bounds set too strictly so I have used doJailbreak.

{ pkgs ? import  {}, compiler ? "ghc822", doBenchmark ? false }:

let

f = { mkDerivation, haskell, base, foldl, Frames, fuzzyset
    , inline-r, integration, lens
    , pandoc-types, plots
    , diagrams-rasterific
    , diagrams
    , diagrams-svg
    , diagrams-contrib
    , R, random, stdenv
    , template-haskell, temporary }:
mkDerivation {
  pname = "mrp";
  version = "1.0.0";
  src = ./.;
  isLibrary = false;
  isExecutable = true;
  executableHaskellDepends = [
    base
    diagrams
    diagrams-rasterific
    diagrams-svg
    diagrams-contrib
    (haskell.lib.dontCheck inline-r)
    foldl
    Frames
    fuzzyset
    integration
    lens
    pandoc-types
    plots
    random
    template-haskell
    temporary
  ];
  executableSystemDepends = [
    R
    pkgs.rPackages.dplyr
    pkgs.rPackages.ggmap
    pkgs.rPackages.ggplot2
    pkgs.rPackages.knitr
    pkgs.rPackages.maptools
    pkgs.rPackages.reshape2
    pkgs.rPackages.rgeos
    pkgs.rPackages.rgdal
    pkgs.rPackages.rstan
    pkgs.rPackages.zoo];
  license = stdenv.lib.licenses.bsd3;
};

haskellPackages = if compiler == "default"
  then pkgs.haskellPackages
  else pkgs.haskell.packages.${compiler};

myHaskellPackages = haskellPackages.override {
  overrides = self: super: with pkgs.haskell.lib; {
    diagrams-rasterific = doJailbreak super.diagrams-rasterific;
    diagrams-svg = doJailbreak super.diagrams-svg;
    diagrams-contrib = doJailbreak super.diagrams-contrib;
    diagrams = doJailbreak super.diagrams;
    inline-r = dontCheck super.inline-r;
    pipes-group = doJailbreak super.pipes-group;
  };
};

variant = if doBenchmark then pkgs.haskell.lib.doBenchmark else pkgs.lib.id;

drv = variant (myHaskellPackages.callPackage f {});

in

  if pkgs.lib.inNixShell then drv.env else drv
Advertisements

Reproducibility and Old Faithful

Introduction

For the blog post still being written on variatonal methods, I referred to the still excellent Bishop (2006) who uses as his example data, the data available in R for the geyser in Yellowstone National Park called “Old Faithful”.

While explaining this to another statistician, they started to ask about the dataset. Since I couldn’t immediately answer their questions (when was it collected? over how long? how was it collected?), I started to look at the dataset more closely. The more I dug into where the data came from, the less certain I became that the dataset actually reflected how Old Faithful actually behaves.

Here’s what I found. On the way, I use Haskell, R embedded in Haskell, data frames in Haskell and nix.

The Investigation

First some imports and language extensions.

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-unused-top-binds #-}
> 
> {-# LANGUAGE QuasiQuotes          #-}
> {-# LANGUAGE ScopedTypeVariables  #-}
> {-# LANGUAGE DataKinds            #-}
> {-# LANGUAGE FlexibleContexts     #-}
> {-# LANGUAGE TemplateHaskell      #-}
> {-# LANGUAGE TypeOperators        #-}
> {-# LANGUAGE TypeSynonymInstances #-}
> {-# LANGUAGE FlexibleInstances    #-}
> {-# LANGUAGE QuasiQuotes          #-}
> {-# LANGUAGE UndecidableInstances #-}
> {-# LANGUAGE ExplicitForAll       #-}
> {-# LANGUAGE ScopedTypeVariables  #-}
> {-# LANGUAGE OverloadedStrings    #-}
> {-# LANGUAGE GADTs                #-}
> {-# LANGUAGE TypeApplications     #-}
> 
> 
> module Main (main) where
> 
> import Prelude as P
> 
> import Control.Lens
> 
> import Data.Foldable
> import Frames hiding ( (:&) )
> 
> import Data.Vinyl (Rec(..))
> 
> import qualified Language.R as R
> import Language.R (R)
> import Language.R.QQ
> import Data.Int
> 
> import Plots as P
> import qualified Diagrams.Prelude as D
> import Diagrams.Backend.Rasterific
> 
> import Data.Time.Clock(UTCTime, NominalDiffTime, diffUTCTime)
> import Data.Time.Clock.POSIX(posixSecondsToUTCTime)
> 
> import Data.Attoparsec.Text
> import Control.Applicative

R provides many datasets presumably for ease of use. We can access the “Old Faithful” data set in Haskell using some very simple inline R.

> eRs :: R s [Double]
> eRs = R.dynSEXP <$> [r| faithful$eruptions |]
> 
> wRs :: R s [Int32]
> wRs = R.dynSEXP <$> [r| faithful$waiting |]

And then plot the dataset using

> kSaxis :: [(Double, Double)] -> P.Axis B D.V2 Double
> kSaxis xs = P.r2Axis &~ do
>   P.scatterPlot' xs
> 
> plotOF :: IO ()
> plotOF = do
>   es' <- R.runRegion $ do eRs
>   ws' <- R.runRegion $ do wRs
>   renderRasterific "diagrams/Test.png"
>                    (D.dims2D 500.0 500.0)
>                    (renderAxis $ kSaxis $ P.zip es' (P.map fromIntegral ws'))

In the documentation for this dataset, the source is given as Härdle (2012). In here the data are given as a table in Table 3 on page 201. It contains 270 observations not 272 as the R dataset does! Note that there seems to be a correlation between inter-eruption time and duration of eruption and eruptions seem to fall into one of two groups.

The documentation also says “See Also geyser in package MASS for the Azzalini–Bowman version”. So let’s look at this and plot it.

> eAltRs :: R s [Double]
> eAltRs = R.dynSEXP <$> [r| geyser$duration |]
> 
> wAltRs :: R s [Int32]
> wAltRs = R.dynSEXP <$> [r| geyser$waiting |]
> 
> plotOF' :: IO ()
> plotOF' = do
>   es <- R.runRegion $ do _ <- [r| library(MASS) |]
>                          eAltRs
>   ws <- R.runRegion $ do wAltRs
>   renderRasterific "diagrams/TestR.png"
>                    (D.dims2D 500.0 500.0)
>                    (renderAxis $ kSaxis $ P.zip es (P.map fromIntegral ws))

A quick look at the first few entries suggests this is the data in Table 1 of Azzalini and Bowman (1990) but NB I didn’t check each item.

It looks quite different to the faithful dataset. But to answer my fellow statistician’s questions: “This analysis deals with data which were collected continuously from August 1st until August 15th, 1985”.

It turns out there is a website which collects data for a large number of geysers and has historical datasets for them including old faithful. The data is in a slightly awkward format and it is not clear (to me at any rate) what some of the fields mean.

Let’s examine the first 1000 observations in the data set using the Haskell package Frames.

First we ask Frames to automatically type the data and read it into a frame.

> tableTypes "OldFaithFul" "/Users/dom/Dropbox/Tidy/NumMethHaskell/variational/Old_Faithful_eruptions_1000.csv"
> 
> loadOldFaithful :: IO (Frame OldFaithFul)
> loadOldFaithful = inCoreAoS (readTable "/Users/dom/Dropbox/Tidy/NumMethHaskell/variational/Old_Faithful_eruptions_1000.csv")

Declare some columns and helper functions to manipulate the times.

> declareColumn "eruptionTimeNew"  ''UTCTime
> declareColumn "eruptionTimeOld"  ''UTCTime
> declareColumn "eruptionTimeDiff" ''NominalDiffTime
> declareColumn "eruptionTimeRat"  ''Rational
> declareColumn "eruptionTimeDoub" ''Double
> declareColumn "parsedDuration"   ''Double
> 
> epochToUTC :: Integral a => a -> UTCTime
> epochToUTC = posixSecondsToUTCTime . fromIntegral
> 
> getEruptionTimeNew :: OldFaithFul -> Record '[EruptionTimeNew]
> getEruptionTimeNew x = pure (Col ((epochToUTC (x ^. eruptionTimeEpoch)))) :& Nil
> 
> getEruptionTimeOld :: OldFaithFul -> Record '[EruptionTimeOld]
> getEruptionTimeOld x = pure (Col ((epochToUTC (x ^. eruptionTimeEpoch)))) :& Nil
> 
> getEruptionTimeDiff :: Record '[EruptionTimeOld, EruptionTimeNew] -> Record '[EruptionTimeDoub]
> getEruptionTimeDiff x = pure (Col ((/60.0) $ fromRational $ toRational $ diffUTCTime (x ^. eruptionTimeNew) (x ^. eruptionTimeOld))) :& Nil

The duration is expressed as a string mainly as e.g. “3.5m” and “3.5min” but occasionally in other formats. Using NaNs to represent data we can’t parse is poor practice but Frames seems to object to types such as Maybe Double.

> getDuration :: OldFaithFul -> Record '["duration" :-> Text]
> getDuration = (rcast @'[Duration])
> 
> getParsedDuration :: Record '[Duration, EruptionTimeDoub] -> Record '[ParsedDuration]
> getParsedDuration x = pure (Col ((f $ (parseOnly parseDuration) (x ^. duration)))) :& Nil
>   where
>     f (Left _)  = 0.0 / 0.0
>     f (Right y) = y
> 
> parseDuration :: Parser Double
> parseDuration =
>   do d <- double
>      _ <- char 'm'
>      return d
>   <|>
>   do d <- double
>      _ <- string "min"
>      return d

To get the times between eruptions we need to zip the frame with its tail so we define this helper function.

> dropFrame :: Int -> Frame r -> Frame r
> dropFrame n s = Frame ((frameLength s) - 1) (\i -> frameRow s (i + n))
> main :: IO ()
> main = do
>   x <- loadOldFaithful

Get the current eruption times and the previous eruption times and put them in a frame.

>   let tn = dropFrame 1 $ fmap (\b -> (getEruptionTimeNew b)) x
>   let tp = fmap (\b -> (getEruptionTimeOld b)) x
>   let tpns = zipFrames tp tn

Get the durations of the eruptions and exclude any gaps when no observations were taken; arbitrarily if the gap is greate than 1.5 hours.

>   let ds = fmap getDuration x
>   let tds = zipFrames ds (fmap getEruptionTimeDiff tpns)
>   let b = filterFrame (\u -> (u ^. eruptionTimeDoub) <= 150) tds

Parse some of the durations and put the durations and the inter-eruption gap into a single frame.

>   let c = filterFrame (\u -> not $ isNaN $ u ^. parsedDuration) $
>           fmap getParsedDuration b
>   let d = zipFrames b c
>   let g = fmap (\u -> (u ^. parsedDuration, u ^. eruptionTimeDoub)) d

Finally we can yet another data set of observations of old faithful.

>   renderRasterific "diagrams/TestH1000.png"
>                    (D.dims2D 500.0 500.0)
>                    (renderAxis $ kSaxis $ toList g)
>   R.withEmbeddedR R.defaultConfig $ do
>     plotOF
>     plotOF'
>   putStrLn "Finished"

To me this doesn’t look like either of the other two datasets. Perhaps the R faithful data set should be validated, possibly retired and maybe replaced by something with firmer foundations?

Nix

I use nix on MACos for package management both for Haskell and R – no more install.packages problems. Here’s my shell.nix file also available here: here.

{ nixpkgs ? import  {}, compiler ? "ghc822", doBenchmark ? false }:

let

  inherit (nixpkgs) pkgs;

f = { mkDerivation, array, base, bytestring, cassava, containers
    , datasets, diagrams-lib, diagrams-rasterific, foldl, Frames, ghc-prim, hmatrix, hmatrix-gsl
    , inline-r, lens, mtl, pipes, plots, random-fu, R, random-source
    , stdenv, template-haskell, text, typelits-witnesses, vector, vinyl }:
mkDerivation {
  pname = "variational";
  version = "0.1.0.0";
  src = ./.;
  isLibrary = false;
  isExecutable = true;
  executableHaskellDepends = [
    array
    base
    bytestring
    cassava
    containers
    datasets
    diagrams-lib
    diagrams-rasterific
    foldl
    Frames
    ghc-prim
    hmatrix
    inline-r
    lens
    mtl
    pipes
    plots
    random-fu
    random-source
    template-haskell
    text
    typelits-witnesses
    vector
    vinyl
  ];
  executableSystemDepends = [
    R
    pkgs.rPackages.anytime
    pkgs.rPackages.ggplot2
    pkgs.rPackages.maptools
    pkgs.rPackages.reshape2
    pkgs.rPackages.rgeos
    pkgs.rPackages.rgdal
    pkgs.rPackages.rstan ];
  license = stdenv.lib.licenses.bsd3;
};

haskellPackages = if compiler == "default"
  then pkgs.haskellPackages
  else pkgs.haskell.packages.${compiler};

variant = if doBenchmark then pkgs.haskell.lib.doBenchmark else pkgs.lib.id;

drv = variant (haskellPackages.callPackage f {});

in

  if pkgs.lib.inNixShell then drv.env else drv

References

Azzalini, A, and Adrian Bowman. 1990. “Bowman, A.w.: a Look at Some Data on the Old Faithful Geyser. Appl. Stat. 39, 357-365” 39 (January).

Bishop, C.M. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. https://books.google.co.uk/books?id=kTNoQgAACAAJ.

Härdle, W. 2012. Smoothing Techniques: With Implementation in S. Springer Series in Statistics. Springer New York. https://books.google.co.uk/books?id=sijUBwAAQBAJ.

Modelling an Ecosystem via Hamiltonian Monte Carlo

Introduction

Recall from the previous post that the Hare growth parameter undergoes Brownian motion so that the further into the future we go, the less certain we are about it. In order to ensure that this parameter remains positive, let’s model the log of it to be Brownian motion.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1 \bigg(1 - \frac{N_1}{K_1}\bigg) - c_1 N_1 N_2 \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & -\rho_2 N_2 \bigg(1 + \frac{N_2}{K_2}\bigg) + c_2 N_1 N_2 \\  \mathrm{d} \rho_1 & = & \rho_1 \sigma_{\rho_1} \mathrm{d}W_t  \end{aligned}

where the final equation is a stochastic differential equation with W_t being a Wiener process.

By Itô we have

\displaystyle   \mathrm{d} (\log{\rho_1}) = - \frac{\sigma_{\rho_1}^2}{2} \mathrm{d} t + \sigma_{\rho_1} \mathrm{d}W_t

Again, we see that the populations become noisier the further into the future we go.

Inference

Now let us infer the growth rate using Hamiltonian Monte Carlo and the domain specific probabilistic language Stan (Stan Development Team (2015b), Stan Development Team (2015a), Hoffman and Gelman (2014), Carpenter (2015)). Here’s the model expressed in Stan.

functions {
  real f1 (real a, real k1, real b, real p, real z) {
    real q;

    q = a * p * (1 - p / k1) - b * p * z;
    return q;
  }

  real f2 (real d, real k2, real c, real p, real z) {
    real q;

    q = -d * z * (1 + z / k2) + c * p * z;
    return q;
  }
}

data {
  int<lower=1> T;   // Number of observations
  real y[T];        // Observed hares
  real k1;          // Hare carrying capacity
  real b;           // Hare death rate per lynx
  real d;           // Lynx death rate
  real k2;          // Lynx carrying capacity
  real c;           // Lynx birth rate per hare
  real deltaT;      // Time step
}

parameters {
  real<lower=0> mu;
  real<lower=0> sigma;
  real<lower=0> p0;
  real<lower=0> z0;
  real<lower=0> rho0;
  real w[T];
}

transformed parameters {
  real<lower=0> p[T];
  real<lower=0> z[T];
  real<lower=0> rho[T];

  p[1] = p0;
  z[1] = z0;
  rho[1] = rho0;

  for (t in 1:(T-1)){
    p[t+1] = p[t] + deltaT * f1 (exp(rho[t]), k1, b, p[t], z[t]);
    z[t+1] = z[t] + deltaT * f2 (d, k2, c, p[t], z[t]);

    rho[t+1] = rho[t] * exp(sigma * sqrt(deltaT) * w[t] - 0.5 * sigma * sigma * deltaT);
  }
}

model {
  mu    ~ uniform(0.0,1.0);
  sigma ~ uniform(0.0, 0.5);
  p0    ~ lognormal(log(100.0), 0.2);
  z0    ~ lognormal(log(50.0), 0.1);
  rho0  ~ normal(log(mu), sigma);
  w     ~ normal(0.0,1.0);

  for (t in 1:T) {
    y[t] ~ lognormal(log(p[t]),0.1);
  }
}

Let’s look at the posteriors of the hyper-parameters for the Hare growth parameter.

Again, the estimate for \mu is pretty decent. For our generated data, \sigma =0 and given our observations are quite noisy maybe the estimate for this is not too bad also.

Appendix: The R Driving Code

All code including the R below can be downloaded from github.

install.packages("devtools")
library(devtools)
install_github("libbi/RBi",ref="master")
install_github("sbfnk/RBi.helpers",ref="master")

rm(list = ls(all.names=TRUE))
unlink(".RData")

library('RBi')
try(detach(package:RBi, unload = TRUE), silent = TRUE)
library(RBi, quietly = TRUE)

library('RBi.helpers')

library('ggplot2', quietly = TRUE)
library('gridExtra', quietly = TRUE)

endTime <- 50

PP <- bi_model("PP.bi")
synthetic_dataset_PP <- bi_generate_dataset(endtime=endTime,
                                            model=PP,
                                            seed="42",
                                            verbose=TRUE,
                                            add_options = list(
                                                noutputs=500))

rdata_PP <- bi_read(synthetic_dataset_PP)

df <- data.frame(rdata_PP$P$nr,
                 rdata_PP$P$value,
                 rdata_PP$Z$value,
                 rdata_PP$P_obs$value)

ggplot(df, aes(rdata_PP$P$nr, y = Population, color = variable), size = 0.1) +
    geom_line(aes(y = rdata_PP$P$value, col = "Hare"), size = 0.1) +
    geom_line(aes(y = rdata_PP$Z$value, col = "Lynx"), size = 0.1) +
    geom_point(aes(y = rdata_PP$P_obs$value, col = "Observations"), size = 0.1) +
    theme(legend.position="none") +
    ggtitle("Example Data") +
    xlab("Days") +
    theme(axis.text=element_text(size=4),
          axis.title=element_text(size=6,face="bold")) +
    theme(plot.title = element_text(size=10))
ggsave(filename="diagrams/LVdata.png",width=4,height=3)

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

lvStanModel <- stan_model(file = "SHO.stan",verbose=TRUE)

lvFit <- sampling(lvStanModel,
                  seed=42,
                  data=list(T = length(rdata_PP$P_obs$value),
                            y = rdata_PP$P_obs$value,
                            k1 = 2.0e2,
                            b  = 2.0e-2,
                            d  = 4.0e-1,
                            k2 = 2.0e1,
                            c  = 4.0e-3,
                            deltaT = rdata_PP$P_obs$time[2] - rdata_PP$P_obs$time[1]
                            ),
                   chains=1)

samples <- extract(lvFit)

gs1 <- qplot(x = samples$mu, y = ..density.., geom = "histogram") + xlab(expression(\mu))
gs2 <- qplot(x = samples$sigma, y = ..density.., geom = "histogram") + xlab(expression(samples$sigma))
gs3 <- grid.arrange(gs1, gs2)
ggsave(plot=gs3,filename="diagrams/LvPosteriorStan.png",width=4,height=3)

synthetic_dataset_PP1 <- bi_generate_dataset(endtime=endTime,
                                             model=PP,
                                             init = list(P = 100, Z=50),
                                             seed="42",
                                             verbose=TRUE,
                                             add_options = list(
                                                 noutputs=500))

rdata_PP1 <- bi_read(synthetic_dataset_PP1)

synthetic_dataset_PP2 <- bi_generate_dataset(endtime=endTime,
                                             model=PP,
                                             init = list(P = 150, Z=25),
                                             seed="42",
                                             verbose=TRUE,
                                             add_options = list(
                                                 noutputs=500))

rdata_PP2 <- bi_read(synthetic_dataset_PP2)

df1 <- data.frame(hare = rdata_PP$P$value,
                  lynx = rdata_PP$Z$value,
                  hare1 = rdata_PP1$P$value,
                  lynx1 = rdata_PP1$Z$value,
                  hare2 = rdata_PP2$P$value,
                  lynx2 = rdata_PP2$Z$value)

ggplot(df1) +
    geom_path(aes(x=df1$hare,  y=df1$lynx, col = "0"), size = 0.1) +
    geom_path(aes(x=df1$hare1, y=df1$lynx1, col = "1"), size = 0.1) +
    geom_path(aes(x=df1$hare2, y=df1$lynx2, col = "2"), size = 0.1) +
    theme(legend.position="none") +
    ggtitle("Phase Space") +
    xlab("Hare") +
    ylab("Lynx") +
    theme(axis.text=element_text(size=4),
          axis.title=element_text(size=6,face="bold")) +
    theme(plot.title = element_text(size=10))
ggsave(filename="diagrams/PPviaLibBi.png",width=4,height=3)

PPInfer <- bi_model("PPInfer.bi")

bi_object_PP <- libbi(client="sample", model=PPInfer, obs = synthetic_dataset_PP)

bi_object_PP$run(add_options = list(
                     "end-time" = endTime,
                     noutputs = endTime,
                     nsamples = 2000,
                     nparticles = 128,
                     seed=42,
                     nthreads = 1),
                 verbose = TRUE,
                 stdoutput_file_name = tempfile(pattern="pmmhoutput", fileext=".txt"))

bi_file_summary(bi_object_PP$result$output_file_name)

mu <- bi_read(bi_object_PP, "mu")$value
g1 <- qplot(x = mu[2001:4000], y = ..density.., geom = "histogram") + xlab(expression(mu))
sigma <- bi_read(bi_object_PP, "sigma")$value
g2 <- qplot(x = sigma[2001:4000], y = ..density.., geom = "histogram") + xlab(expression(sigma))
g3 <- grid.arrange(g1, g2)
ggsave(plot=g3,filename="diagrams/LvPosterior.png",width=4,height=3)


df2 <- data.frame(hareActs = rdata_PP$P$value,
                  hareObs  = rdata_PP$P_obs$value)

ggplot(df, aes(rdata_PP$P$nr, y = value, color = variable)) +
    geom_line(aes(y = rdata_PP$P$value, col = "Phyto")) +
    geom_line(aes(y = rdata_PP$Z$value, col = "Zoo")) +
    geom_point(aes(y = rdata_PP$P_obs$value, col = "Phyto Obs"))

ln_alpha <- bi_read(bi_object_PP, "ln_alpha")$value

P <- matrix(bi_read(bi_object_PP, "P")$value,nrow=51,byrow=TRUE)
Z <- matrix(bi_read(bi_object_PP, "Z")$value,nrow=51,byrow=TRUE)

data50 <- bi_generate_dataset(endtime=endTime,
                              model=PP,
                              seed="42",
                              verbose=TRUE,
                              add_options = list(
                                  noutputs=50))

rdata50 <- bi_read(data50)

df3 <- data.frame(days = c(1:51), hares = rowMeans(P), lynxes = rowMeans(Z),
                                  actHs = rdata50$P$value, actLs = rdata50$Z$value)


ggplot(df3) +
    geom_line(aes(x = days, y = hares, col = "Est Phyto")) +
    geom_line(aes(x = days, y = lynxes, col = "Est Zoo")) +
    geom_line(aes(x = days, y = actHs, col = "Act Phyto")) +
    geom_line(aes(x = days, y = actLs, col = "Act Zoo"))

Bibliography

Carpenter, Bob. 2015. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software.

Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research.

Stan Development Team. 2015a. Stan Modeling Language User’s Guide and Reference Manual, Version 2.10.0. http://mc-stan.org/.

———. 2015b. “Stan: A C++ Library for Probability and Sampling, Version 2.10.0.” http://mc-stan.org/.

Ecology, Dynamical Systems and Inference via PMMH

Introduction

In the 1920s, Lotka (1909) and Volterra (1926) developed a model of a very simple predator-prey ecosystem.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1  - c_1 N_1 N_2 \label{eq2a} \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & c_2 N_1 N_2 - \rho_2 N2 \label{eq2b}  \end{aligned}

Although simple, it turns out that the Canadian lynx and showshoe hare are well represented by such a model. Furthermore, the Hudson Bay Company kept records of how many pelts of each species were trapped for almost a century, giving a good proxy of the population of each species.

We can capture the fact that we do not have a complete model by describing our state of ignorance about the parameters. In order to keep this as simple as possible let us assume that log parameters undergo Brownian motion. That is, we know the parameters will jiggle around and the further into the future we look the less certain we are about what values they will have taken. By making the log parameters undergo Brownian motion, we can also capture our modelling assumption that birth, death and predation rates are always positive. A similar approach is taken in Dureau, Kalogeropoulos, and Baguelin (2013) where the (log) parameters of an epidemiological model are taken to be Ornstein-Uhlenbeck processes (which is biologically more plausible although adds to the complexity of the model, something we wish to avoid in an example such as this).

Andrieu, Doucet, and Holenstein (2010) propose a method to estimate the parameters of such models (Particle Marginal Metropolis Hastings aka PMMH) and the domain specific probabilistic language LibBi (Murray (n.d.)) can be used to apply this (and other inference methods).

For the sake of simplicity, in this blog post, we only model one parameter as being unknown and undergoing Brownian motion. A future blog post will consider more sophisticated scenarios.

A Dynamical System Aside

The above dynamical system is structurally unstable (more on this in a future post), a possible indication that it should not be considered as a good model of predator–prey interaction. Let us modify this to include carrying capacities for the populations of both species.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1 \bigg(1 - \frac{N_1}{K_1}\bigg) - c_1 N_1 N_2 \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & -\rho_2 N_2 \bigg(1 + \frac{N_2}{K_2}\bigg) + c_2 N_1 N_2  \end{aligned}

Data Generation with LibBi

Let’s generate some data using LibBi.

// Generate data assuming a fixed growth rate for hares rather than
// e.g. a growth rate that undergoes Brownian motion.

model PP {
  const h         = 0.1;    // time step
  const delta_abs = 1.0e-3; // absolute error tolerance
  const delta_rel = 1.0e-6; // relative error tolerance

  const a  = 5.0e-1 // Hare growth rate
  const k1 = 2.0e2  // Hare carrying capacity
  const b  = 2.0e-2 // Hare death rate per lynx
  const d  = 4.0e-1 // Lynx death rate
  const k2 = 2.0e1  // Lynx carrying capacity
  const c  = 4.0e-3 // Lynx birth rate per hare

  state P, Z       // Hares and lynxes
  state ln_alpha   // Hare growth rate - we express it in log form for
                   // consistency with the inference model
  obs P_obs        // Observations of hares

  sub initial {
    P ~ log_normal(log(100.0), 0.2)
    Z ~ log_normal(log(50.0), 0.1)
  }

  sub transition(delta = h) {
    ode(h = h, atoler = delta_abs, rtoler = delta_rel, alg = 'RK4(3)') {
      dP/dt =  a * P * (1 - P / k1) - b * P * Z
      dZ/dt = -d * Z * (1 + Z / k2) + c * P * Z
    }
  }

  sub observation {
    P_obs ~ log_normal(log(P), 0.1)
  }
}

We can look at phase space starting with different populations and see they all converge to the same fixed point.

Data Generation with Haskell

Since at some point in the future, I plan to produce Haskell versions of the methods given in Andrieu, Doucet, and Holenstein (2010), let’s generate the data using Haskell.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> module LotkaVolterra (
>     solLv
>   , solPp
>   , h0
>   , l0
>   , baz
>   , logBM
>   , eulerEx
>   )where
> import Numeric.GSL.ODE
> import Numeric.LinearAlgebra
> import Data.Random.Source.PureMT
> import Data.Random hiding ( gamma )
> import Control.Monad.State

Here’s the unstable model.

> lvOde :: Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          [Double] ->
>          [Double]
> lvOde rho1 c1 rho2 c2 _t [h, l] =
>   [
>     rho1 * h - c1 * h * l
>   , c2 * h * l - rho2 * l
>   ]
> lvOde _rho1 _c1 _rho2 _c2 _t vars =
>   error $ "lvOde called with: " ++ show (length vars) ++ " variable"
> rho1, c1, rho2, c2 :: Double
> rho1 = 0.5
> c1 = 0.02
> rho2 = 0.4
> c2 = 0.004
> deltaT :: Double
> deltaT = 0.1
> solLv :: Matrix Double
> solLv = odeSolve (lvOde rho1 c1 rho2 c2)
>                  [50.0, 50.0]
>                  (fromList [0.0, deltaT .. 50])

And here’s the stable model.

> ppOde :: Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          Double ->
>          [Double] ->
>          [Double]
> ppOde a k1 b d k2 c _t [p, z] =
>   [
>     a * p * (1 - p / k1) - b * p * z
>   , -d * z * (1 + z / k2) + c * p * z
>   ]
> ppOde _a _k1 _b _d _k2 _c _t vars =
>   error $ "ppOde called with: " ++ show (length vars) ++ " variable"
> a, k1, b, d, k2, c :: Double
> a = 0.5
> k1 = 200.0
> b = 0.02
> d = 0.4
> k2 = 50.0
> c = 0.004
> solPp :: Double -> Double -> Matrix Double
> solPp x y = odeSolve (ppOde a k1 b d k2 c)
>                  [x, y]
>                  (fromList [0.0, deltaT .. 50])
> gamma, alpha, beta :: Double
> gamma = d / a
> alpha = a / (c * k1)
> beta  = d / (a * k2)
> fp :: (Double, Double)
> fp = ((gamma + beta) / (1 + alpha * beta), (1 - gamma * alpha) / (1 + alpha * beta))
> h0, l0 :: Double
> h0 = a * fst fp / c
> l0 = a * snd fp / b
> foo, bar :: Matrix R
> foo = matrix 2 [a / k1, b, c, negate d / k2]
> bar = matrix 1 [a, d]
> baz :: Maybe (Matrix R)
> baz = linearSolve foo bar

This gives a stable fixed point of

ghci> baz
  Just (2><1)
   [ 120.00000000000001
   ,               10.0 ]

Here’s an example of convergence to that fixed point in phase space.

The Stochastic Model

Let us now assume that the Hare growth parameter undergoes Brownian motion so that the further into the future we go, the less certain we are about it. In order to ensure that this parameter remains positive, let’s model the log of it to be Brownian motion.

\displaystyle   \begin{aligned}  \frac{\mathrm{d}N_1}{\mathrm{d}t} & = & \rho_1 N_1 \bigg(1 - \frac{N_1}{K_1}\bigg) - c_1 N_1 N_2 \\  \frac{\mathrm{d}N_2}{\mathrm{d}t} & = & -\rho_2 N_2 \bigg(1 + \frac{N_2}{K_2}\bigg) + c_2 N_1 N_2 \\  \mathrm{d} \rho_1 & = & \rho_1 \sigma_{\rho_1} \mathrm{d}W_t  \end{aligned}

where the final equation is a stochastic differential equation with W_t being a Wiener process.

By Itô we have

\displaystyle   \mathrm{d} (\log{\rho_1}) = - \frac{\sigma_{\rho_1}^2}{2} \mathrm{d} t + \sigma_{\rho_1} \mathrm{d}W_t

We can use this to generate paths for \rho_1.

\displaystyle   \rho_1(t + \Delta t) = \rho_1(t)\exp{\bigg(- \frac{\sigma_{\rho_1}^2}{2} \Delta t + \sigma_{\rho_1} \sqrt{\Delta t} Z\bigg)}

where Z \sim {\mathcal{N}}(0,1).

> oneStepLogBM :: MonadRandom m => Double -> Double -> Double -> m Double
> oneStepLogBM deltaT sigma rhoPrev = do
>   x <- sample $ rvar StdNormal
>   return $ rhoPrev * exp(sigma * (sqrt deltaT) * x - 0.5 * sigma * sigma * deltaT)
> iterateM :: Monad m => (a -> m a) -> m a -> Int -> m [a]
> iterateM f mx n = sequence . take n . iterate (>>= f) $ mx
> logBMM :: MonadRandom m => Double -> Double -> Int -> Int -> m [Double]
> logBMM initRho sigma n m =
>   iterateM (oneStepLogBM (recip $ fromIntegral n) sigma) (return initRho) (n * m)
> logBM :: Double -> Double -> Int -> Int -> Int -> [Double]
> logBM initRho sigma n m seed =
>   evalState (logBMM initRho sigma n m) (pureMT $ fromIntegral seed)

We can see the further we go into the future the less certain we are about the value of the parameter.

Using this we can simulate the whole dynamical system which is now a stochastic process.

> f1, f2 :: Double -> Double -> Double ->
>           Double -> Double ->
>           Double
> f1 a k1 b p z = a * p * (1 - p / k1) - b * p * z
> f2 d k2 c p z = -d * z * (1 + z / k2) + c * p * z
> oneStepEuler :: MonadRandom m =>
>                 Double ->
>                 Double ->
>                 Double -> Double ->
>                 Double -> Double -> Double ->
>                 (Double, Double, Double) ->
>                 m (Double, Double, Double)
> oneStepEuler deltaT sigma k1 b d k2 c (rho1Prev, pPrev, zPrev) = do
>     let pNew = pPrev + deltaT * f1 (exp rho1Prev) k1 b pPrev zPrev
>     let zNew = zPrev + deltaT * f2 d k2 c pPrev zPrev
>     rho1New <- oneStepLogBM deltaT sigma rho1Prev
>     return (rho1New, pNew, zNew)
> euler :: MonadRandom m =>
>          (Double, Double, Double) ->
>          Double ->
>          Double -> Double ->
>          Double -> Double -> Double ->
>          Int -> Int ->
>          m [(Double, Double, Double)]
> euler stateInit sigma k1 b d k2 c n m =
>   iterateM (oneStepEuler (recip $ fromIntegral n) sigma k1 b d k2 c)
>            (return stateInit)
>            (n * m)
> eulerEx :: (Double, Double, Double) ->
>            Double -> Int -> Int -> Int ->
>            [(Double, Double, Double)]
> eulerEx stateInit sigma n m seed =
>   evalState (euler stateInit sigma k1 b d k2 c n m) (pureMT $ fromIntegral seed)

We see that the populations become noisier the further into the future we go.

Notice that the second order effects of the system are now to some extent captured by the fact that the growth rate of Hares can drift. In our simulation, this is demonstrated by our decreasing lack of knowledge the further we look into the future.

Inference

Now let us infer the growth rate using PMMH. Here’s the model expressed in LibBi.

// Infer growth rate for hares

model PP {
  const h         = 0.1;    // time step
  const delta_abs = 1.0e-3; // absolute error tolerance
  const delta_rel = 1.0e-6; // relative error tolerance

  const a  = 5.0e-1 // Hare growth rate - superfluous for inference
                    // but a reminder of what we should expect
  const k1 = 2.0e2  // Hare carrying capacity
  const b  = 2.0e-2 // Hare death rate per lynx
  const d  = 4.0e-1 // Lynx death rate
  const k2 = 2.0e1  // Lynx carrying capacity
  const c  = 4.0e-3 // Lynx birth rate per hare

  state P, Z       // Hares and lynxes
  state ln_alpha   // Hare growth rate - we express it in log form for
                   // consistency with the inference model
  obs P_obs        // Observations of hares
  param mu, sigma  // Mean and standard deviation of hare growth rate
  noise w          // Noise

  sub parameter {
    mu ~ uniform(0.0, 1.0)
    sigma ~ uniform(0.0, 0.5)
  }

  sub proposal_parameter {
     mu ~ truncated_gaussian(mu, 0.02, 0.0, 1.0);
     sigma ~ truncated_gaussian(sigma, 0.01, 0.0, 0.5);
   }

  sub initial {
    P ~ log_normal(log(100.0), 0.2)
    Z ~ log_normal(log(50.0), 0.1)
    ln_alpha ~ gaussian(log(mu), sigma)
  }

  sub transition(delta = h) {
    w ~ normal(0.0, sqrt(h));
    ode(h = h, atoler = delta_abs, rtoler = delta_rel, alg = 'RK4(3)') {
      dP/dt =  exp(ln_alpha) * P * (1 - P / k1) - b * P * Z
      dZ/dt = -d * Z * (1 + Z / k2) + c * P * Z
      dln_alpha/dt = -sigma * sigma / 2 - sigma * w / h
    }
  }

  sub observation {
    P_obs ~ log_normal(log(P), 0.1)
  }
}

Let’s look at the posteriors of the hyper-parameters for the Hare growth parameter.

The estimate for \mu is pretty decent. For our generated data, \sigma =0 and given our observations are quite noisy maybe the estimate for this is not too bad also.

Appendix: The R Driving Code

All code including the R below can be downloaded from github but make sure you use the straight-libbi branch and not master.

install.packages("devtools")
library(devtools)
install_github("sbfnk/RBi",ref="master")
install_github("sbfnk/RBi.helpers",ref="master")

rm(list = ls(all.names=TRUE))
unlink(".RData")

library('RBi')
try(detach(package:RBi, unload = TRUE), silent = TRUE)
library(RBi, quietly = TRUE)

library('RBi.helpers')

library('ggplot2', quietly = TRUE)
library('gridExtra', quietly = TRUE)

endTime <- 50

PP <- bi_model("PP.bi")
synthetic_dataset_PP <- bi_generate_dataset(endtime=endTime,
                                            model=PP,
                                            seed="42",
                                            verbose=TRUE,
                                            add_options = list(
                                                noutputs=500))

rdata_PP <- bi_read(synthetic_dataset_PP)

df <- data.frame(rdata_PP$P$nr,
                 rdata_PP$P$value,
                 rdata_PP$Z$value,
                 rdata_PP$P_obs$value)

ggplot(df, aes(rdata_PP$P$nr, y = Population, color = variable), size = 0.1) +
    geom_line(aes(y = rdata_PP$P$value, col = "Hare"), size = 0.1) +
    geom_line(aes(y = rdata_PP$Z$value, col = "Lynx"), size = 0.1) +
    geom_point(aes(y = rdata_PP$P_obs$value, col = "Observations"), size = 0.1) +
    theme(legend.position="none") +
    ggtitle("Example Data") +
    xlab("Days") +
    theme(axis.text=element_text(size=4),
          axis.title=element_text(size=6,face="bold")) +
    theme(plot.title = element_text(size=10))
ggsave(filename="diagrams/LVdata.png",width=4,height=3)

synthetic_dataset_PP1 <- bi_generate_dataset(endtime=endTime,
                                             model=PP,
                                             init = list(P = 100, Z=50),
                                             seed="42",
                                             verbose=TRUE,
                                             add_options = list(
                                                 noutputs=500))

rdata_PP1 <- bi_read(synthetic_dataset_PP1)

synthetic_dataset_PP2 <- bi_generate_dataset(endtime=endTime,
                                             model=PP,
                                             init = list(P = 150, Z=25),
                                             seed="42",
                                             verbose=TRUE,
                                             add_options = list(
                                                 noutputs=500))

rdata_PP2 <- bi_read(synthetic_dataset_PP2)

df1 <- data.frame(hare = rdata_PP$P$value,
                  lynx = rdata_PP$Z$value,
                  hare1 = rdata_PP1$P$value,
                  lynx1 = rdata_PP1$Z$value,
                  hare2 = rdata_PP2$P$value,
                  lynx2 = rdata_PP2$Z$value)

ggplot(df1) +
    geom_path(aes(x=df1$hare,  y=df1$lynx, col = "0"), size = 0.1) +
    geom_path(aes(x=df1$hare1, y=df1$lynx1, col = "1"), size = 0.1) +
    geom_path(aes(x=df1$hare2, y=df1$lynx2, col = "2"), size = 0.1) +
    theme(legend.position="none") +
    ggtitle("Phase Space") +
    xlab("Hare") +
    ylab("Lynx") +
    theme(axis.text=element_text(size=4),
          axis.title=element_text(size=6,face="bold")) +
    theme(plot.title = element_text(size=10))
ggsave(filename="diagrams/PPviaLibBi.png",width=4,height=3)

PPInfer <- bi_model("PPInfer.bi")

bi_object_PP <- libbi(client="sample", model=PPInfer, obs = synthetic_dataset_PP)

bi_object_PP$run(add_options = list(
                     "end-time" = endTime,
                     noutputs = endTime,
                     nsamples = 4000,
                     nparticles = 128,
                     seed=42,
                     nthreads = 1),
                 ## verbose = TRUE,
                 stdoutput_file_name = tempfile(pattern="pmmhoutput", fileext=".txt"))

bi_file_summary(bi_object_PP$result$output_file_name)

mu <- bi_read(bi_object_PP, "mu")$value
g1 <- qplot(x = mu[2001:4000], y = ..density.., geom = "histogram") + xlab(expression(mu))
sigma <- bi_read(bi_object_PP, "sigma")$value
g2 <- qplot(x = sigma[2001:4000], y = ..density.., geom = "histogram") + xlab(expression(sigma))
g3 <- grid.arrange(g1, g2)
ggsave(plot=g3,filename="diagrams/LvPosterior.png",width=4,height=3)


df2 <- data.frame(hareActs = rdata_PP$P$value,
                  hareObs  = rdata_PP$P_obs$value)

ggplot(df, aes(rdata_PP$P$nr, y = value, color = variable)) +
    geom_line(aes(y = rdata_PP$P$value, col = "Phyto")) +
    geom_line(aes(y = rdata_PP$Z$value, col = "Zoo")) +
    geom_point(aes(y = rdata_PP$P_obs$value, col = "Phyto Obs"))

ln_alpha <- bi_read(bi_object_PP, "ln_alpha")$value

P <- matrix(bi_read(bi_object_PP, "P")$value,nrow=51,byrow=TRUE)
Z <- matrix(bi_read(bi_object_PP, "Z")$value,nrow=51,byrow=TRUE)

data50 <- bi_generate_dataset(endtime=endTime,
                              model=PP,
                              seed="42",
                              verbose=TRUE,
                              add_options = list(
                                  noutputs=50))

rdata50 <- bi_read(data50)

df3 <- data.frame(days = c(1:51), hares = rowMeans(P), lynxes = rowMeans(Z),
                                  actHs = rdata50$P$value, actLs = rdata50$Z$value)


ggplot(df3) +
    geom_line(aes(x = days, y = hares, col = "Est Phyto")) +
    geom_line(aes(x = days, y = lynxes, col = "Est Zoo")) +
    geom_line(aes(x = days, y = actHs, col = "Act Phyto")) +
    geom_line(aes(x = days, y = actLs, col = "Act Zoo"))

Bibliography

Andrieu, Christophe, Arnaud Doucet, and Roman Holenstein. 2010. “Particle Markov chain Monte Carlo methods.” Journal of the Royal Statistical Society. Series B: Statistical Methodology 72 (3): 269–342. doi:10.1111/j.1467-9868.2009.00736.x.

Dureau, Joseph, Konstantinos Kalogeropoulos, and Marc Baguelin. 2013. “Capturing the time-varying drivers of an epidemic using stochastic dynamical systems.” Biostatistics (Oxford, England) 14 (3): 541–55. doi:10.1093/biostatistics/kxs052.

Lotka, Alfred J. 1909. “Contribution to the Theory of Periodic Reactions.” The Journal of Physical Chemistry 14 (3): 271–74. doi:10.1021/j150111a004.

Murray, Lawrence M. n.d. “Bayesian State-Space Modelling on High-Performance Hardware Using LibBi.”

Volterra, Vito. 1926. “Variazioni e fluttuazioni del numero d’individui in specie animali conviventi.” Memorie Della R. Accademia Dei Lincei 6 (2): 31–113. http://www.liberliber.it/biblioteca/v/volterra/variazioni{\_}e{\_}fluttuazioni/pdf/volterra{\_}variazioni{\_}e{\_}fluttuazioni.pdf.

Gibbs Sampling in R, Haskell, Jags and Stan

Introduction

It’s possible to Gibbs sampling in most languages and since I am doing some work in R and some work in Haskell, I thought I’d present a simple example in both languages: estimating the mean from a normal distribution with unknown mean and variance. Although one can do Gibbs sampling directly in R, it is more common to use a specialised language such as JAGS or STAN to do the actual sampling and do pre-processing and post-processing in R. This blog post presents implementations in native R, JAGS and STAN as well as Haskell.

Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> module Gibbs (
>     main
>   , m
>   , Moments(..)
>   ) where
> 
> import qualified Data.Vector.Unboxed as V
> import qualified Control.Monad.Loops as ML
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram ( asList )
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List
> import qualified Control.Foldl as L
> 
> import Diagrams.Backend.Cairo.CmdLine
> 
> import LinRegAux
> 
> import Diagrams.Backend.CmdLine
> import Diagrams.Prelude hiding ( sample, render )

The length of our chain and the burn-in.

> nrep, nb :: Int
> nb   = 5000
> nrep = 105000

Data generated from {\cal{N}}(10.0, 5.0).

> xs :: [Double]
> xs = [
>     11.0765808082301
>   , 10.918739177542
>   , 15.4302462747137
>   , 10.1435649220266
>   , 15.2112705014697
>   , 10.441327659703
>   , 2.95784054883142
>   , 10.2761068139607
>   , 9.64347295100318
>   , 11.8043359297675
>   , 10.9419989262713
>   , 7.21905367667346
>   , 10.4339807638017
>   , 6.79485294803006
>   , 11.817248658832
>   , 6.6126710570584
>   , 12.6640920214508
>   , 8.36604701073303
>   , 12.6048485320333
>   , 8.43143879537592
>   ]

A Bit of Theory

Gibbs Sampling

For a multi-parameter situation, Gibbs sampling is a special case of Metropolis-Hastings in which the proposal distributions are the posterior conditional distributions.

Referring back to the explanation of the metropolis algorithm, let us describe the state by its parameters i \triangleq \boldsymbol{\theta}^{(i)} \triangleq (\theta^{(i)}_1,\ldots, \theta^{(i)}_n) and the conditional posteriors by \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big) where {\boldsymbol{\theta}}^{(i)}_{-k} = \big(\theta_1^{(i)},\ldots,\theta_{k-1}^{(i)},\theta_{k+1}^{(i)}\ldots\theta_n^{(i)}\big) then

\displaystyle   \begin{aligned}  \frac{\pi\big(\boldsymbol{\theta}^{(j)}\big)q\big(\boldsymbol{\theta}^{(j)}, \boldsymbol{\theta}^{(i)}\big)}  {\pi(\boldsymbol{\theta}^{(i)})q(\boldsymbol{\theta}^{(i)}, \boldsymbol{\theta}^{(j)})}  &=  \frac{  \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  }  {  \pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(i)}\big)\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big)  } \\  &=  \frac{  \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  }  {  \pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  } \\  &= 1  \end{aligned}

where we have used the rules of conditional probability and the fact that \boldsymbol{\theta}_i^{(-k)} = \boldsymbol{\theta}_j^{(-k)}

Thus we always accept the proposed jump. Note that the chain is not in general reversible as the order in which the updates are done matters.

Normal Distribution with Unknown Mean and Variance

It is fairly standard to use an improper prior

\displaystyle   \begin{aligned}  \pi(\mu, \tau) \propto \frac{1}{\tau} & & -\infty < \mu < \infty\, \textrm{and}\, 0 < \tau < \infty  \end{aligned}

The likelihood is

\displaystyle   p(\boldsymbol{x}\,|\,\mu, \sigma) = \prod_{i=1}^n \bigg(\frac{1}{\sigma\sqrt{2\pi}}\bigg)\exp{\bigg( -\frac{(x_i - \mu)^2}{2\sigma^2}\bigg)}

re-writing in terms of precision

\displaystyle   p(\boldsymbol{x}\,|\,\mu, \tau) \propto \prod_{i=1}^n \sqrt{\tau}\exp{\bigg( -\frac{\tau}{2}{(x_i - \mu)^2}\bigg)} = \tau^{n/2}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)}

Thus the posterior is

\displaystyle   p(\mu, \tau \,|\, \boldsymbol{x}) \propto \tau^{n/2 - 1}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)}

We can re-write the sum in terms of the sample mean \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i and variance s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 using

\displaystyle   \begin{aligned}  \sum_{i=1}^n (x_i - \mu)^2 &= \sum_{i=1}^n (x_i - \bar{x} + \bar{x} - \mu)^2 \\  &= \sum_{i=1}^n (x_i - \bar{x})^2 - 2\sum_{i=1}^n (x_i - \bar{x})(\bar{x} - \mu) + \sum_{i=1}^n (\bar{x} - \mu)^2 \\  &= \sum_{i=1}^n (x_i - \bar{x})^2 - 2(\bar{x} - \mu)\sum_{i=1}^n (x_i - \bar{x}) + \sum_{i=1}^n (\bar{x} - \mu)^2 \\  &= (n - 1)s^2 + n(\bar{x} - \mu)^2  \end{aligned}

Thus the conditional posterior for \mu is

\displaystyle   \begin{aligned}  p(\mu \,|\, \tau, \boldsymbol{x}) &\propto \exp{\bigg( -\frac{\tau}{2}\bigg(\nu s^2 + \sum_{i=1}^n{(\mu - \bar{x})^2}\bigg)\bigg)} \\  &\propto \exp{\bigg( -\frac{n\tau}{2}{(\mu - \bar{x})^2}\bigg)} \\  \end{aligned}

which we recognise as a normal distribution with mean of \bar{x} and a variance of (n\tau)^{-1}.

The conditional posterior for \tau is

\displaystyle   \begin{aligned}  p(\tau \,|\, , \mu, \boldsymbol{x}) &\propto \tau^{n/2 -1}\exp\bigg(-\tau\frac{1}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)  \end{aligned}

which we recognise as a gamma distribution with a shape of n/2 and a scale of \frac{1}{2}\sum_{i=1}^n{(x_i - \mu)^2}

In this particular case, we can calculate the marginal posterior of \mu analytically. Writing z = \frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2} we have

\displaystyle   \begin{aligned}  p(\mu \,|\, \boldsymbol{x}) &= \int_0^\infty p(\mu, \tau \,|\, \boldsymbol{x}) \textrm{d}\tau \\  &\propto \int_0^\infty \tau^{n/2 - 1}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)} \textrm{d}\tau \\  &\propto \bigg( \sum_{i=1}^n{(x_i - \mu)^2} \bigg)^{-n/2} \int_0^\infty z^{n/2 - 1}\exp{-z}\textrm{d}\tau \\  &\propto \bigg( \sum_{i=1}^n{(x_i - \mu)^2} \bigg)^{-n/2} \\  \end{aligned}

Finally we can calculate

\displaystyle   \begin{aligned}  p(\mu \,|\, \boldsymbol{x}) &\propto \bigg( (n - 1)s^2 + n(\bar{x} - \mu)^2 \bigg)^{-n/2} \\  &\propto \bigg( 1 + \frac{n(\mu - \bar{x})^2}{(n - 1)s^2} \bigg)^{-n/2} \\  \end{aligned}

This is the non-standardized Student’s t-distribution t_{n-1}(\bar{x}, s^2/n).

Alternatively the marginal posterior of \mu is

\displaystyle   \frac{\mu - \bar{x}}{s/\sqrt{n}}\bigg|\, x \sim t_{n-1}

where t_{n-1} is the standard t distribution with n - 1 degrees of freedom.

The Model in Haskell

Following up on a comment from a previous blog post, let us try using the foldl package to calculate the length, the sum and the sum of squares traversing the list only once. An improvement on creating your own strict record and using foldl’ but maybe it is not suitable for some methods e.g. calculating the skewness and kurtosis incrementally, see below.

> x2Sum, xSum, n :: Double
> (x2Sum, xSum, n) = L.fold stats xs
>   where
>     stats = (,,) <$>
>             (L.premap (\x -> x * x) L.sum) <*>
>             L.sum <*>
>             L.genericLength

And re-writing the sample variance s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 using

\displaystyle   \begin{aligned}  \sum_{i=1}^n (x_i - \bar{x})^2 &= \sum_{i=1}^n (x_i^2 - 2x_i\bar{x} + \bar{x}^2) \\  &= \sum_{i=1}^n x_i^2 - 2\bar{x}\sum_{i=1}^n x_i + \sum_{i=1}^n \bar{x}^2 \\  &= \sum_{i=1}^n x_i^2 - 2n\bar{x}^2 + n\bar{x}^2 \\  &= \sum_{i=1}^n x_i^2 - n\bar{x}^2 \\  \end{aligned}

we can then calculate the sample mean and variance using the sums we have just calculated.

> xBar, varX :: Double
> xBar = xSum / n
> varX = n * (m2Xs - xBar * xBar) / (n - 1)
>   where m2Xs = x2Sum / n

In random-fu, the Gamma distribution is specified by the rate paratmeter, \beta.

> beta, initTau :: Double
> beta = 0.5 * n * varX
> initTau = evalState (sample (Gamma (n / 2) beta)) (pureMT 1)

Our sampler takes an old value of \tau and creates new values of \mu and \tau.

> gibbsSampler :: MonadRandom m => Double -> m (Maybe ((Double, Double), Double))
> gibbsSampler oldTau = do
>   newMu <- sample (Normal xBar (recip (sqrt (n * oldTau))))
>   let shape = 0.5 * n
>       scale = 0.5 * (x2Sum + n * newMu^2 - 2 * n * newMu * xBar)
>   newTau <- sample (Gamma shape (recip scale))
>   return $ Just ((newMu, newTau), newTau)

From which we can create an infinite stream of samples.

> gibbsSamples :: [(Double, Double)]
> gibbsSamples = evalState (ML.unfoldrM gibbsSampler initTau) (pureMT 1)

As our chains might be very long, we calculate the mean, variance, skewness and kurtosis using an incremental method.

> data Moments = Moments { mN :: !Double
>                        , m1 :: !Double
>                        , m2 :: !Double
>                        , m3 :: !Double
>                        , m4 :: !Double
>                        }
>   deriving Show
> moments :: [Double] -> Moments
> moments xs = foldl' f (Moments 0.0 0.0 0.0 0.0 0.0) xs
>   where
>     f :: Moments -> Double -> Moments
>     f m x = Moments n' m1' m2' m3' m4'
>       where
>         n = mN m
>         n'  = n + 1
>         delta = x - (m1 m)
>         delta_n = delta / n'
>         delta_n2 = delta_n * delta_n
>         term1 = delta * delta_n * n
>         m1' = m1 m + delta_n
>         m4' = m4 m +
>               term1 * delta_n2 * (n'*n' - 3*n' + 3) +
>               6 * delta_n2 * m2 m - 4 * delta_n * m3 m
>         m3' = m3 m + term1 * delta_n * (n' - 2) - 3 * delta_n * m2 m
>         m2' = m2 m + term1

In order to examine the posterior, we create a histogram.

> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
>   where
>     lower = xBar - 2.0 * sqrt varX
>     upper = xBar + 2.0 * sqrt varX

And fill it with the specified number of samples preceeded by a burn-in.

> hist :: Histogram V.Vector BinD Double
> hist = fillBuilder hb (take (nrep - nb) $ drop nb $ map fst gibbsSamples)

Now we can plot this.

And calculate the skewness and kurtosis.

> m :: Moments
> m = moments (take (nrep - nb) $ drop nb $ map fst gibbsSamples)
ghci> import Gibbs
ghci> putStrLn $ show $ (sqrt (mN m)) * (m3 m) / (m2 m)**1.5
  8.733959917065126e-4

ghci> putStrLn $ show $ (mN m) * (m4 m) / (m2 m)**2
  3.451374739494607

We expect a skewness of 0 and a kurtosis of 3 + 6 / \nu - 4 = 3.4 for \nu = 19. Not too bad.

The Model in JAGS

JAGS is a mature, declarative, domain specific language for building Bayesian statistical models using Gibbs sampling.

Here is our model as expressed in JAGS. Somewhat terse.

model {
	for (i in 1:N) {
		x[i] ~ dnorm(mu, tau)
	}
	mu ~ dnorm(0, 1.0E-6)
	tau <- pow(sigma, -2)
	sigma ~ dunif(0, 1000)
}

To run it and examine its results, we wrap it up in some R

## Import the library that allows R to inter-work with jags.
library(rjags)

## Read the simulated data into a data frame.
fn <- read.table("example1.data", header=FALSE)

jags <- jags.model('example1.bug',
                   data = list('x' = fn[,1], 'N' = 20),
                   n.chains = 4,
                   n.adapt = 100)

## Burnin for 10000 samples
update(jags, 10000);

mcmc_samples <- coda.samples(jags, variable.names=c("mu", "sigma"), n.iter=20000)

png(file="diagrams/jags.png",width=400,height=350)
plot(mcmc_samples)
dev.off()

And now we can look at the posterior for \mu.

The Model in STAN

STAN is a domain specific language for building Bayesian statistical models similar to JAGS but newer and which allows variables to be re-assigned and so cannot really be described as declarative.

Here is our model as expressed in STAN. Again, somewhat terse.

data {
  int<lower=0> N;
  real x[N];
}

parameters {
  real mu;
  real<lower=0,upper=1000> sigma;
}
model{
  x     ~ normal(mu, sigma);
  mu    ~ normal(0, 1000);
}

Just as with JAGS, to run it and examine its results, we wrap it up in some R.

library(rstan)

## Read the simulated data into a data frame.
fn <- read.table("example1.data", header=FALSE)

## Running the model
fit1 <- stan(file = 'Stan.stan',
             data = list('x' = fn[,1], 'N' = 20),
             pars=c("mu", "sigma"),
             chains=3,
             iter=30000,
             warmup=10000)

png(file="diagrams/stan.png",width=400,height=350)
plot(fit1)
dev.off()

Again we can look at the posterior although we only seem to get medians and 80% intervals.

PostAmble

Write the histogram produced by the Haskell code to a file.

> displayHeader :: FilePath -> Diagram B R2 -> IO ()
> displayHeader fn =
>   mainRender ( DiagramOpts (Just 900) (Just 700) fn
>              , DiagramLoopOpts False Nothing 0
>              )
> main :: IO ()
> main = do
>   displayHeader "diagrams/DataScienceHaskPost.png"
>     (barDiag
>      (zip (map fst $ asList hist) (map snd $ asList hist)))

The code can be downloaded from github.

Getting Financial Data in R

I have recently started providing consultancy to a hedge fund and as far as I can see, R looks like it has a good set of libraries for this domain. In my previous job I used an embedded domain specific language in Haskell (Frankau et al. 2009). I’d like to be able to use Haskell again but my impression is that publicly available libraries either do not exist or are not particularly mature although I would love to be proved wrong.

I’ve used knitr to produce this post rather than my usual BlogLiteratelyD.

For example, let us plot an index.

First we load the quantmod library

library(quantmod)

We can chart the S&P 500 for 2013.

GSPC <- getSymbols("^GSPC", src = "yahoo", auto.assign = FALSE)
dim(GSPC)
## [1] 1768    6
head(GSPC, 4)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2007-01-03      1418      1429     1408       1417   3.429e+09
## 2007-01-04      1417      1422     1408       1418   3.004e+09
## 2007-01-05      1418      1418     1406       1410   2.919e+09
## 2007-01-08      1409      1415     1404       1413   2.763e+09
##            GSPC.Adjusted
## 2007-01-03          1417
## 2007-01-04          1418
## 2007-01-05          1410
## 2007-01-08          1413
tail(GSPC, 4)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2014-01-06      1832      1837     1824       1827   3.295e+09
## 2014-01-07      1829      1840     1829       1838   3.512e+09
## 2014-01-08      1838      1840     1831       1837   3.652e+09
## 2014-01-09      1839      1843     1830       1838   3.581e+09
##            GSPC.Adjusted
## 2014-01-06          1827
## 2014-01-07          1838
## 2014-01-08          1837
## 2014-01-09          1838
chartSeries(GSPC, subset = "2013", theme = "white")
S&P 500 for 2013

S&P 500 for 2013

We can also chart currencies e.g. the Rupee / US Dollar exchange rate.

INRUSD <- getSymbols("INR=X", src = "yahoo", auto.assign = FALSE)
dim(INRUSD)
## [1] 1805    6
head(INRUSD, 4)
##            INR=X.Open INR=X.High INR=X.Low INR=X.Close INR=X.Volume
## 2007-01-01      44.22      44.22     44.04       44.22            0
## 2007-01-02      44.21      44.22     44.08       44.12            0
## 2007-01-03      44.12      44.41     44.09       44.11            0
## 2007-01-04      44.12      44.48     44.10       44.10            0
##            INR=X.Adjusted
## 2007-01-01          44.22
## 2007-01-02          44.12
## 2007-01-03          44.11
## 2007-01-04          44.10
tail(INRUSD, 4)
##            INR=X.Open INR=X.High INR=X.Low INR=X.Close INR=X.Volume
## 2014-01-01      61.84      61.97     61.80       61.80            0
## 2014-01-02      61.84      62.41     61.74       61.84            0
## 2014-01-03      62.06      62.57     62.06       62.06            0
## 2014-01-06      62.23      62.45     61.94       62.23            0
##            INR=X.Adjusted
## 2014-01-01          61.80
## 2014-01-02          61.84
## 2014-01-03          62.06
## 2014-01-06          62.23
chartSeries(INRUSD, subset = "2013", theme = "white")
Rupee / US Dollar exchange rate for 2103

Rupee / US Dollar exchange rate for 2013

Bibliography

Frankau, Simon, Diomidis Spinellis, Nick Nassuphis, and Christoph Burgard. 2009. “Commercial Uses: Going Functional on Exotic Trades.” J. Funct. Program. 19 (1) (jan): 27–45. doi:10.1017/S0956796808007016. http://dx.doi.org/10.1017/S0956796808007016.

The Metropolis Algorithm

A Simple Example

In Section 7.1 of Doing Bayesian Data Analysis, John Kruschke gives a simple example of the Metropolis algorithm, in which we generate samples from a distribution without knowing the distribution itself. Of course the example is contrived as we really do know the distribtion. In the particular example, {\cal P}(X = i) = i/k for i = 1…n where n is some fixed natural number and k is a normalising constant.

Here’s the algorithm in Haskell. We use the random-fu package and the rvar package for random variables and the random package to supply random numbers.

{-# LANGUAGE ScopedTypeVariables, NoMonomorphismRestriction #-}

import Data.Random
import Data.RVar
import System.Random
import Control.Monad.State
import Data.List

We pick an explicit seed and set n = 7 (in Kruschke’s example this is the number of islands that a politician visits).

seed :: Int
seed = 2
numIslands :: Int
numIslands = 7

And we pick an arbitrary number of jumps to try in the Metropolis algorithm.

n = 11112

We generate proposed jumps which are either one step up or one step down.

proposedJumps :: Int -> [Int]
proposedJumps seed =
  map f $ fst $
  runState (replicateM n $ sampleRVar $ uniform False True) (mkStdGen seed)
    where f False = negate 1
          f True  = 1

And we generate samples from the uniform distribution on [0, 1] which will allow us to determine whether to accept or reject a proposed jump.

acceptOrRejects :: Int -> [Double]
acceptOrRejects seed =
  fst $ runState (replicateM n $ sampleRVar $ uniform 0 1) (mkStdGen seed)

We pretend we only know a measure of how often we pick a given number but not the normalising constant to make this measure a probability measure.

p n | n >= 1 && n <= numIslands = n
    | otherwise                 = 0

We define a function which defines one step of the Metropolis algorithm.

f currentPosition (proposedJump, acceptOrReject) =
  if acceptOrReject < probAccept
    then
      currentPosition + proposedJump
    else
      currentPosition
  where
    probAccept = min 1 (pNew / pCur)
    pNew = fromIntegral $ currentPosition + proposedJump
    pCur = fromIntegral currentPosition

Let’s try this out with a somewhat arbitrary burn in period starting at position 3.

runMC seed = map (/ total) numVisits
  where
    total     = sum numVisits
    numVisits = map (\j -> fromIntegral $ length $ filter (==j) $ xs)
                    [1 .. numIslands]
    xs        = drop (n `div` 10) $
                scanl f 3 (zip (proposedJumps seed) (acceptOrRejects seed))

We can then compute the root mean square error for this particular sample size.

actual = map (/s) xs
           where
             xs = map fromIntegral [1 .. numIslands]
             s  = sum xs

rmsError n = sqrt $ (/(fromIntegral numIslands)) $
             sum $ map (^2) $ zipWith (-) actual (runMC n)

A Bernoulli Example

We can also code the Metropolis algorithm for the example in which samples are drawn from a Bernoulli distribution. First we can define the likelihood for drawing drawing a specific sequence of 0’s and 1’s from a binomial distribution with parrameter θ. We also define the example sample of data given in Doing Bayesian Data Analysis.


likelihood :: Int -> Int -> Double -> Double
likelihood z n theta = theta^z * (1 - theta)^(n - z)

myData = [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

We define a function which defines one step of the Metropolis algorithm.

oneStep currPosition (propJump, acceptOrReject) =
  if acceptOrReject < probAccept
    then
      currPosition + propJump
    else
      currPosition
  where
    probAccept = min 1 (p (currPosition + propJump) / p currPosition)
    p x | x < 0 = 0
        | x > 1 = 0
        | otherwise = pAux myData x
    pAux :: [Double] -> Double -> Double
    pAux xs position = likelihood z n position
      where
        n = length xs
        z = length $ filter (== 1) xs

Finally we need some proposed jumps; following the example we generate these from a normal distribution {\cal N}(0, 0.1).

normals :: [Double]
normals =  fst $ runState (replicateM n (sampleRVar (normal 0 0.1)))
                          (mkStdGen seed)

And now we can run the Metropolis algorithm and, for example, find the mean of the posterior.

accepteds = drop (n `div` 10) $
            scanl oneStep 0.5 (zip normals (acceptOrRejects seed))

mean = total / count
         where
           (total, count) = foldl' f (0.0, 0) accepteds
           f (s, c) x = (s+x, c+1)

Functional R

We can now re-write the example in R in a more functional way.

## An arbitrary number of jumps to try in Metropolis
## algorithm.
trajLength = 11112

## The data represented as a vector.
myData = c( 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 )

## An arbitrary seed presumably used in generating random
## values from the various distributions.
set.seed(47405)

## We need some proposed jumps; following the example we
## generate these from a normal distribution.
normals = rep (0, trajLength)
for (j in 1:trajLength) {
normals[j] = rnorm( 1, mean = 0, sd = 0.1 )
}

## We generate proposed jumps which are either one step up
## or one step down.
proposedJumps = rep (0, trajLength)
for (j in 1:trajLength) {
proposedJumps[j] =
if ( runif( 1 ) 1 | theta < 0 ] = 0
return ( x )
}

prior = function( position, xs ) {
n = length( xs )
z = sum ( xs == 1)
return ( likelihood ( z, n, position ) )
}

## We define a function which defines one step of the
## Metropolis algorithm.
oneStep = function ( currPosition, propJumpAorR ) {
proposedJump = propJumpAorR [1]
acceptOrReject = propJumpAorR [2]
probAccept = min( 1,
prior( currPosition + proposedJump , myData )
/ prior( currPosition , myData ) )
if ( acceptOrReject < probAccept ) {
trajectory = currPosition + proposedJump
} else {
trajectory = currPosition
}
return ( trajectory )
}

## Pair together the proposed jumps and the probability
## whether a given proposal will be accepted or rejected.
nsAorRs <- list ()
for (i in 1:trajLength)
nsAorRs[[i]] <- c(normals[i], acceptOrRejects[i])

## Fold (well really scanl) over the pairs returning all the
## intermediate results.
trajectory = Reduce( function(a,b) oneStep (a, b),
nsAorRs, accumulate=T,init= 0.5 )

## Drop the values for the burn in period.
burnIn = ceiling( .1 * trajLength )
acceptedTraj = trajectory[burnIn:trajLength]

## Finally return the mean of the posterior.
result = mean ( acceptedTraj )