Haskell Vectors and Sampling from a Categorical Distribution


Suppose we have a vector of weights which sum to 1.0 and we wish to sample n samples randomly according to these weights. There is a well known trick in Matlab / Octave using sampling from a uniform distribution.

num_particles = 2*10^7
likelihood = zeros(num_particles,1);
likelihood(:,1) = 1/num_particles;
[_,index] = histc(rand(num_particles,1),[0;cumsum(likelihood/sum(likelihood))]);
s = sum(index);

Using tic and toc this produces an answer with

Elapsed time is 10.7763 seconds.


I could find no equivalent function in Haskell nor could I easily find a binary search function.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE BangPatterns                 #-}
> import System.Random.MWC
> import qualified Data.Vector.Unboxed as V
> import Control.Monad.ST
> import qualified Data.Vector.Algorithms.Search as S
> import Data.Bits
> n :: Int
> n = 2*10^7

Let’s create some random data. For a change let’s use mwc-random rather than random-fu.

> vs  :: V.Vector Double
> vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))

Again, I could find no equivalent of cumsum but we can write our own.

> weightsV, cumSumWeightsV :: V.Vector Double
> weightsV = V.replicate n (recip $ fromIntegral n)
> cumSumWeightsV = V.scanl (+) 0 weightsV

Binary search on a sorted vector is straightforward and a cumulative sum ensures that the vector is sorted.

> binarySearch :: (V.Unbox a, Ord a) =>
>                 V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
>   where
>     loop !l !u
>       | u <= l    = l
>       | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
>       where k = l + (u - l) `shiftR` 1
> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs

To see how well this performs, let’s sum the indices (of course, we wouldn’t do this in practice) as we did for the Matlab implementation.

> js :: V.Vector Int
> js = indices (V.tail cumSumWeightsV) vs
> main :: IO ()
> main = do
>   print $ V.foldl' (+) 0 js

Using +RTS -s we get

Total   time   10.80s  ( 11.06s elapsed)

which is almost the same as the Matlab version.

I did eventually find a binary search function in vector-algorithms and since one should not re-invent the wheel, let us try using it.

> indices' :: (V.Unbox a, Ord a) => V.Vector a -> V.Vector a -> V.Vector Int
> indices' sv x = runST $ do
>   st <- V.unsafeThaw (V.tail sv)
>   V.mapM (S.binarySearch st) x
> main' :: IO ()
> main' = do
>   print $  V.foldl' (+) 0 $ indices' cumSumWeightsV vs

Again using +RTS -s we get

Total   time   11.34s  ( 11.73s elapsed)

So the library version seems very slightly slower.

10 thoughts on “Haskell Vectors and Sampling from a Categorical Distribution

  1. Just a minor nit. Your implementation of binary search has the standard error in it. Admittedly, with 64-bit Ints, it’s extremely unlikely that you would run into it. (It’s quite feasible for 32-bit Ints.) The issue is in the midpoint calculation: u+l could overflow, e.g. 2^63+2^63 = 2^64 = 0. The standard solution is to calculate l + (u-l) `shiftR` 1.

  2. Oh, and I suspect the difference in performance is due to your definition of binarySearch being local to the module which allows GHC to do things that it can only sometimes do with non-local functions. You could try copying the vector-algorithms code into your module and see if that changes the performance. (It’s also possible that the vector-algorithms implementation is more general. I haven’t looked at it.)

  3. I expect the comparison will depend on the number of samples generated. If n is the number of weights and m the number of samples, then the complexities for initialization+generation are: binary search O(n)+O(m log(n)), alias method O(n)+O(m). Depending on n and the constants, it may take a large m before the alias method becomes faster.

  4. Excellent point although in my test I had n = m. I had a quick conversation with Lennart on Stackoverflow and we both think the performance difference is because one uses System.Random and the other uses System.Random.MWC. Sadly I don’t have time to experiment at the moment although I would love something that has better performance.

  5. Hi,

    FWIW I think I have implemented this in my probable package (http://hackage.haskell.org/package/probable) in the finite distribution part. This wasn’t my idea, I simply borrowed the trick from Eric Kidd’s blog posts. See ‘liftF’ and ‘pick’ in this article: http://www.randomhacks.net/2007/02/21/randomly-sampled-distributions/

    I also tried to make probable’s abstraction cost the least possible, so that I stay pretty much on par with ‘mwc-random’ for many common and useful cases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s