# Introduction

Suppose we have a vector of weights which sum to 1.0 and we wish to sample n samples randomly according to these weights. There is a well known trick in Matlab / Octave using sampling from a uniform distribution.

``````num_particles = 2*10^7
likelihood = zeros(num_particles,1);
likelihood(:,1) = 1/num_particles;
[_,index] = histc(rand(num_particles,1),[0;cumsum(likelihood/sum(likelihood))]);
s = sum(index);``````

Using tic and toc this produces an answer with

``Elapsed time is 10.7763 seconds.``

I could find no equivalent function in Haskell nor could I easily find a binary search function.

``````> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
``````
``````> {-# LANGUAGE BangPatterns                 #-}
``````
``````> import System.Random.MWC
> import qualified Data.Vector.Unboxed as V
``````
``````> import qualified Data.Vector.Algorithms.Search as S
``````
``````> import Data.Bits
``````
``````> n :: Int
> n = 2*10^7
``````

Let’s create some random data. For a change let’s use mwc-random rather than random-fu.

``````> vs  :: V.Vector Double
> vs = runST (create >>= (asGenST \$ \gen -> uniformVector gen n))
``````

Again, I could find no equivalent of cumsum but we can write our own.

``````> weightsV, cumSumWeightsV :: V.Vector Double
> weightsV = V.replicate n (recip \$ fromIntegral n)
> cumSumWeightsV = V.scanl (+) 0 weightsV
``````

Binary search on a sorted vector is straightforward and a cumulative sum ensures that the vector is sorted.

``````> binarySearch :: (V.Unbox a, Ord a) =>
>                 V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
>   where
>     loop !l !u
>       | u <= l    = l
>       | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
>       where k = l + (u - l) `shiftR` 1
``````
``````> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs
``````

To see how well this performs, let’s sum the indices (of course, we wouldn’t do this in practice) as we did for the Matlab implementation.

``````> js :: V.Vector Int
> js = indices (V.tail cumSumWeightsV) vs
``````
``````> main :: IO ()
> main = do
>   print \$ V.foldl' (+) 0 js
``````

Using +RTS -s we get

``Total   time   10.80s  ( 11.06s elapsed)``

which is almost the same as the Matlab version.

I did eventually find a binary search function in vector-algorithms and since one should not re-invent the wheel, let us try using it.

``````> indices' :: (V.Unbox a, Ord a) => V.Vector a -> V.Vector a -> V.Vector Int
> indices' sv x = runST \$ do
>   st <- V.unsafeThaw (V.tail sv)
>   V.mapM (S.binarySearch st) x
``````
``````> main' :: IO ()
> main' = do
>   print \$  V.foldl' (+) 0 \$ indices' cumSumWeightsV vs
``````

Again using +RTS -s we get

``Total   time   11.34s  ( 11.73s elapsed)``

So the library version seems very slightly slower.

## 10 thoughts on “Haskell Vectors and Sampling from a Categorical Distribution”

1. Derek Elkins

Just a minor nit. Your implementation of binary search has the standard error in it. Admittedly, with 64-bit Ints, it’s extremely unlikely that you would run into it. (It’s quite feasible for 32-bit Ints.) The issue is in the midpoint calculation: u+l could overflow, e.g. 2^63+2^63 = 2^64 = 0. The standard solution is to calculate l + (u-l) `shiftR` 1.

• Dominic Steinitz

Excellent point – now fixed.

2. Derek Elkins

Oh, and I suspect the difference in performance is due to your definition of binarySearch being local to the module which allows GHC to do things that it can only sometimes do with non-local functions. You could try copying the vector-algorithms code into your module and see if that changes the performance. (It’s also possible that the vector-algorithms implementation is more general. I haven’t looked at it.)

3. Dominic Steinitz

The vector-algorithms implementation is more general.

4. Jan Van lent

I expect the comparison will depend on the number of samples generated. If n is the number of weights and m the number of samples, then the complexities for initialization+generation are: binary search O(n)+O(m log(n)), alias method O(n)+O(m). Depending on n and the constants, it may take a large m before the alias method becomes faster.

5. Dominic Steinitz

Excellent point although in my test I had n = m. I had a quick conversation with Lennart on Stackoverflow and we both think the performance difference is because one uses System.Random and the other uses System.Random.MWC. Sadly I don’t have time to experiment at the moment although I would love something that has better performance.

6. Hi,

FWIW I think I have implemented this in my probable package (http://hackage.haskell.org/package/probable) in the finite distribution part. This wasn’t my idea, I simply borrowed the trick from Eric Kidd’s blog posts. See ‘liftF’ and ‘pick’ in this article: http://www.randomhacks.net/2007/02/21/randomly-sampled-distributions/

I also tried to make probable’s abstraction cost the least possible, so that I stay pretty much on par with ‘mwc-random’ for many common and useful cases.

7. Dominic Steinitz

Cool! I learnt about from some colleagues who use matlab. Maybe it is part of the folklore although it is probably also in Luc Devroye’s book: http://luc.devroye.org/rnbookindex.html. I wish I had more time to play with this stuff.