# Introduction

Suppose we have a vector of weights which sum to 1.0 and we wish to sample n samples randomly according to these weights. There is a well known trick in Matlab / Octave using sampling from a uniform distribution.

```
num_particles = 2*10^7
likelihood = zeros(num_particles,1);
likelihood(:,1) = 1/num_particles;
[_,index] = histc(rand(num_particles,1),[0;cumsum(likelihood/sum(likelihood))]);
s = sum(index);
```

Using *tic* and *toc* this produces an answer with

`Elapsed time is 10.7763 seconds.`

# Haskell

I could find no equivalent function in Haskell nor could I easily find a binary search function.

```
> {-# OPTIONS_GHC -Wall #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans #-}
```

```
> {-# LANGUAGE BangPatterns #-}
```

```
> import System.Random.MWC
> import qualified Data.Vector.Unboxed as V
> import Control.Monad.ST
```

```
> import qualified Data.Vector.Algorithms.Search as S
```

```
> import Data.Bits
```

```
> n :: Int
> n = 2*10^7
```

Let’s create some random data. For a change let’s use mwc-random rather than random-fu.

```
> vs :: V.Vector Double
> vs = runST (create >>= (asGenST $ \gen -> uniformVector gen n))
```

Again, I could find no equivalent of *cumsum* but we can write our own.

```
> weightsV, cumSumWeightsV :: V.Vector Double
> weightsV = V.replicate n (recip $ fromIntegral n)
> cumSumWeightsV = V.scanl (+) 0 weightsV
```

Binary search on a sorted vector is straightforward and a cumulative sum ensures that the vector is sorted.

```
> binarySearch :: (V.Unbox a, Ord a) =>
> V.Vector a -> a -> Int
> binarySearch vec x = loop 0 (V.length vec - 1)
> where
> loop !l !u
> | u <= l = l
> | otherwise = let e = vec V.! k in if x <= e then loop l k else loop (k+1) u
> where k = l + (u - l) `shiftR` 1
```

```
> indices :: V.Vector Double -> V.Vector Double -> V.Vector Int
> indices bs xs = V.map (binarySearch bs) xs
```

To see how well this performs, let’s sum the indices (of course, we wouldn’t do this in practice) as we did for the Matlab implementation.

```
> js :: V.Vector Int
> js = indices (V.tail cumSumWeightsV) vs
```

```
> main :: IO ()
> main = do
> print $ V.foldl' (+) 0 js
```

Using *+RTS -s* we get

`Total time 10.80s ( 11.06s elapsed)`

which is almost the same as the Matlab version.

I did eventually find a binary search function in vector-algorithms and since one should not re-invent the wheel, let us try using it.

```
> indices' :: (V.Unbox a, Ord a) => V.Vector a -> V.Vector a -> V.Vector Int
> indices' sv x = runST $ do
> st <- V.unsafeThaw (V.tail sv)
> V.mapM (S.binarySearch st) x
```

```
> main' :: IO ()
> main' = do
> print $ V.foldl' (+) 0 $ indices' cumSumWeightsV vs
```

Again using *+RTS -s* we get

`Total time 11.34s ( 11.73s elapsed)`

So the library version seems very slightly slower.

Just a minor nit. Your implementation of binary search has the standard error in it. Admittedly, with 64-bit Ints, it’s extremely unlikely that you would run into it. (It’s quite feasible for 32-bit Ints.) The issue is in the midpoint calculation: u+l could overflow, e.g. 2^63+2^63 = 2^64 = 0. The standard solution is to calculate l + (u-l) `shiftR` 1.

Excellent point – now fixed.

Oh, and I suspect the difference in performance is due to your definition of binarySearch being local to the module which allows GHC to do things that it can only sometimes do with non-local functions. You could try copying the vector-algorithms code into your module and see if that changes the performance. (It’s also possible that the vector-algorithms implementation is more general. I haven’t looked at it.)

The vector-algorithms implementation is more general.

You could also use the alias method (http://www.keithschwarz.com/darts-dice-coins). Searching for [“alias method” haskell] turns up some implementations.

Looks very interesting – I just tried the Haskell implementation given here: http://stackoverflow.com/questions/8785577/generating-random-integers-with-given-probabilities but “Total time 55.59s” and worse “6087 MB total memory in use” but perhaps I am not testing it correctly.

I expect the comparison will depend on the number of samples generated. If n is the number of weights and m the number of samples, then the complexities for initialization+generation are: binary search O(n)+O(m log(n)), alias method O(n)+O(m). Depending on n and the constants, it may take a large m before the alias method becomes faster.

Excellent point although in my test I had n = m. I had a quick conversation with Lennart on Stackoverflow and we both think the performance difference is because one uses System.Random and the other uses System.Random.MWC. Sadly I don’t have time to experiment at the moment although I would love something that has better performance.

Hi,

FWIW I think I have implemented this in my probable package (http://hackage.haskell.org/package/probable) in the finite distribution part. This wasn’t my idea, I simply borrowed the trick from Eric Kidd’s blog posts. See ‘liftF’ and ‘pick’ in this article: http://www.randomhacks.net/2007/02/21/randomly-sampled-distributions/

I also tried to make probable’s abstraction cost the least possible, so that I stay pretty much on par with ‘mwc-random’ for many common and useful cases.

Cool! I learnt about from some colleagues who use matlab. Maybe it is part of the folklore although it is probably also in Luc Devroye’s book: http://luc.devroye.org/rnbookindex.html. I wish I had more time to play with this stuff.