Logistic Regression and Automated Differentiation

Introduction

Having shown how to use automated differentiation to estimate parameters in the case of linear regression let us now turn our attention to the problem of classification. For example, we might have some data about people’s social networking such as volume of twitter interactions and number of twitter followers together with a label which represents a human judgement about which one of the two individuals is more influential. We would like to predict, for a pair of individuals, the human judgement on who is more influential.

Logistic Regression

We define the probability of getting a particular value of the binary label:

\displaystyle   \begin{aligned}  {\mathbb P}(y = 1 \mid \boldsymbol{x}; \boldsymbol{\theta}) &= h_{\boldsymbol{\theta}}(\boldsymbol{x}) \\  {\mathbb P}(y = 0 \mid \boldsymbol{x}; \boldsymbol{\theta}) &= 1 - h_{\boldsymbol{\theta}}(\boldsymbol{x})  \end{aligned}

where \boldsymbol{x^{(i)}} and \boldsymbol{\theta} are column vectors of size m

\displaystyle   h_{\boldsymbol{\theta}}(\boldsymbol{x}) = g(\boldsymbol{\theta}^T\boldsymbol{x})

and g is a function such as the logistic function g(x) = 1 / (1 + e^{-x}) or \tanh.

We can re-write this as:

\displaystyle   p(y \mid \boldsymbol{x} ; \boldsymbol{\theta}) = (h_{\boldsymbol{\theta}}(\boldsymbol{x}))^y(1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}))^{1 - y}

We wish to find the value of \boldsymbol{\theta} that gives the maximum probability to the observations. We do this by maximising the likelihood. Assuming we have n observations the likelihood is:

\displaystyle   \begin{aligned}  \mathcal{L}(\boldsymbol{\theta}) &= \prod_{i=1}^n p(y^{(i)} \mid {\boldsymbol{x}}^{(i)} ; \boldsymbol{\theta}) \\            &= \prod_{i=1}^n (h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}))^{y^{(i)}} (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}))^{1 - y^{(i)}}  \end{aligned}

It is standard practice to maximise the log likelihood which will give the same maximum as log is monotonic.

\displaystyle   \begin{aligned}  \lambda(\boldsymbol{\theta}) &= \log \mathcal{L}(\boldsymbol{\theta}) \\            &= \sum_{i=1}^n {y^{(i)}}\log h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}) + (1 - y^{(i)})\log (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}))  \end{aligned}

In order to maximize the cost function, we again use the method of steepest ascent: if \boldsymbol{\theta}^i is a guess for the parameters of the model then we can improve the guess by stepping a small distance in the direction of greatest change.

\displaystyle   \boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}^{i} - \gamma \nabla\mathcal{J}(\boldsymbol{\theta})

\gamma is some constant known in machine learning as the learning rate. It must be chosen to be large enough to ensure convergence within a reasonable number of steps but not so large that the algorithm fails to converge.

When the number of observations is high then the cost of evaluating the cost function can be high; as a cheaper alternative we can use stochastic gradient descent. Instead of taking the gradient with respect to all the observations, we take the gradient with respect to each observation in our data set. Of course if our data set is small we may have to use the data set several times to achieve convergence.

When the observations / training data are linearly separable then the magnitude of the parameters can grow without bound as the (parametized) logistic function then tends to the Heaviside / step function. Moreover, it is obvious that there can be more than one separaing hyperplane in this circumstance. To circumvent these infelicities, we instead maximize a penalized log likelihood function:

\displaystyle   \sum_{i=1}^n {y^{(i)}}\log h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}) + (1 - y^{(i)})\log (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)})) - \frac{\delta}{2}\|\boldsymbol{\theta}\|^2

See Bishop and Mitchell for further details.

Implementation

Some pragmas to warn us about potentially dangerous situations.

> {-# OPTIONS_GHC -Wall                    #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults  #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
> 
> {-# LANGUAGE TupleSections #-}
> 
> module Logistic ( betas
>                 , main
>                 , a
>                 , b
>                 , nSamples
>                 ) where

Modules from the automatic differentiation library.

> import Numeric.AD
> import Numeric.AD.Types
> import qualified Data.Vector as V
> import Control.Monad
> import Control.Monad.State
> import Data.List
> import Text.Printf

Some modules from a random number generator library as we will want to generate some test data.

> import System.Random
> import Data.Random ()
> import Data.Random.Distribution.Beta
> import Data.RVar

Our model: the probability that y has the label 1 given the observations \boldsymbol{x}.

> logit :: Floating a =>
>          a -> a
> logit x = 1 / (1 + exp (negate x))

For each observation, the log likelihood:

> logLikelihood :: Floating a => V.Vector a -> a -> V.Vector a -> a
> logLikelihood theta y x = y * log (logit z) +
>                           (1 - y) * log (1 - logit z)
>   where
>     z = V.sum $ V.zipWith (*) theta x
> totalLogLikelihood :: Floating a =>
>                       V.Vector a ->
>                       V.Vector a ->
>                       V.Vector (V.Vector a) ->
>                       a
> totalLogLikelihood theta y x = (a - delta * b) / l
>   where
>     l = fromIntegral $ V.length y
>     a = V.sum $ V.zipWith (logLikelihood theta) y x
>     b = (/2) $ V.sum $ V.map (^2) theta

As before we can implement steepest descent using the grad function.

> delTotalLogLikelihood :: Floating a =>
>                 V.Vector a ->
>                 V.Vector (V.Vector a) ->
>                 V.Vector a ->
>                 V.Vector a
> delTotalLogLikelihood y x = grad f
>   where
>     f theta = totalLogLikelihood theta
>                                  (V.map auto y)
>                                  (V.map (V.map auto) x)
> 
> stepOnce :: Double ->
>             V.Vector Double ->
>             V.Vector (V.Vector Double) ->
>             V.Vector Double ->
>             V.Vector Double
> stepOnce gamma y x theta =
>   V.zipWith (+) theta (V.map (* gamma) $ del theta)
>   where
>     del = delTotalLogLikelihood y x

Or even easier just use the library function gradientAscent!

> estimates :: (Floating a, Ord a) =>
>              V.Vector a ->
>              V.Vector (V.Vector a) ->
>              V.Vector a ->
>              [V.Vector a]
> estimates y x = gradientAscent $
>                 \theta -> totalLogLikelihood theta
>                                              (V.map auto y)
>                                              (V.map (V.map auto) x)

Let’s try it out. First we need to generate some data. Rather arbitrarily let us create some populations from the beta distribution.

> betas :: Int -> Double -> Double -> [Double]
> betas n a b =
>   fst $ runState (replicateM n (sampleRVar (beta a b))) (mkStdGen seed)
>     where
>       seed = 0

We can plot the populations we wish to distinguish by sampling.

> a, b :: Double
> a          = 15
> b          = 6
> nSamples :: Int
> nSamples   = 100000
> 
> sample0, sample1 :: [Double]
> sample0 = betas nSamples a b
> sample1 = betas nSamples b a

Note that in this case we could come up with a classification rule by inspecting the histograms. Furthermore, the populations overlap which means we will inevitably mis-classify some observations.

> mixSamples :: [Double] -> [Double] -> [(Double, Double)]
> mixSamples xs ys = unfoldr g ((map (0,) xs), (map (1,) ys))
>   where
>     g ([], [])         = Nothing
>     g ([],  _)         = Nothing
>     g ( _, [])         = Nothing
>     g ((x:xs), (y:ys)) = Just $ (x, (y:ys, xs))
> createSample :: V.Vector (Double, Double)
> createSample = V.fromList $ take 100 $ mixSamples sample1 sample0

We create a model with one independent variables and thus two parameters.

> actualTheta :: V.Vector Double
> actualTheta = V.fromList [0.0, 1.0]

We initialise our algorithm with arbitrary values.

> initTheta :: V.Vector Double
> initTheta = V.replicate (V.length actualTheta) 0.1

Set the learning rate, the strength of the penalty term and the number of iterations.

> gamma :: Double
> gamma = 0.1
> 
> delta :: Floating a => a
> delta = 1.0
> 
> nIters :: Int
> nIters = 4000

Now we can run our example. For the constant parameter of our model (aka in machine learning as the bias) we ensure that the correspoding “independent variable” is always set to 1.0.

> vals :: V.Vector (Double, V.Vector Double)
> vals = V.map (\(y, x) -> (y, V.fromList [1.0, x])) $ createSample
> main :: IO ()
> main = do
>   let u = V.map fst vals
>       v = V.map snd vals
>       hs = iterate (stepOnce gamma u v) initTheta
>       xs = V.map snd vals
>       theta =  head $ drop nIters hs
>       theta' = head $ drop 100 $ estimates u v initTheta
>   printf "Hand crafted descent: theta_0 = %5.3f, theta_1 = %5.3f\n"
>          (theta V.! 0) (theta V.! 1)
>   printf "Library descent:      theta_0 = %5.3f, theta_1 = %5.3f\n"
>          (theta' V.! 0) (theta' V.! 1)
>   let predProbs  = V.map (\x -> logit $ V.sum $ V.zipWith (*) theta x) xs
>       mismatches = V.filter (> 0.5) $
>                    V.map abs $
>                    V.zipWith (-) actuals preds
>         where
>           actuals = V.map fst vals
>           preds   = V.map (\x -> fromIntegral $ fromEnum (x > 0.5))
>                           predProbs
>   let lActuals, lMisMatches :: Double
>       lActuals    = fromIntegral $ V.length vals
>       lMisMatches = fromIntegral $ V.length mismatches
>   printf "%5.2f%% correct\n" $
>          100.0 *  (lActuals - lMisMatches) / lActuals

And we get quite reasonable estimates:

ghci> main
  Hand crafted descent: theta_0 = -2.071, theta_1 = 4.318
  Library descent:      theta_0 = -2.070, theta_1 = 4.319
  97.00% correct
About these ads
This entry was posted in Haskell, Machine Learning, Probability, Statistics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s