# Logistic Regression and Automated Differentiation

## Introduction

Having shown how to use automated differentiation to estimate parameters in the case of linear regression let us now turn our attention to the problem of classification. For example, we might have some data about people’s social networking such as volume of twitter interactions and number of twitter followers together with a label which represents a human judgement about which one of the two individuals is more influential. We would like to predict, for a pair of individuals, the human judgement on who is more influential.

## Logistic Regression

We define the probability of getting a particular value of the binary label:

\displaystyle \begin{aligned} {\mathbb P}(y = 1 \mid \boldsymbol{x}; \boldsymbol{\theta}) &= h_{\boldsymbol{\theta}}(\boldsymbol{x}) \\ {\mathbb P}(y = 0 \mid \boldsymbol{x}; \boldsymbol{\theta}) &= 1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}) \end{aligned}

where $\boldsymbol{x^{(i)}}$ and $\boldsymbol{\theta}$ are column vectors of size $m$

$\displaystyle h_{\boldsymbol{\theta}}(\boldsymbol{x}) = g(\boldsymbol{\theta}^T\boldsymbol{x})$

and $g$ is a function such as the logistic function $g(x) = 1 / (1 + e^{-x})$ or $\tanh$.

We can re-write this as:

$\displaystyle p(y \mid \boldsymbol{x} ; \boldsymbol{\theta}) = (h_{\boldsymbol{\theta}}(\boldsymbol{x}))^y(1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}))^{1 - y}$

We wish to find the value of $\boldsymbol{\theta}$ that gives the maximum probability to the observations. We do this by maximising the likelihood. Assuming we have $n$ observations the likelihood is:

\displaystyle \begin{aligned} \mathcal{L}(\boldsymbol{\theta}) &= \prod_{i=1}^n p(y^{(i)} \mid {\boldsymbol{x}}^{(i)} ; \boldsymbol{\theta}) \\ &= \prod_{i=1}^n (h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}))^{y^{(i)}} (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}))^{1 - y^{(i)}} \end{aligned}

It is standard practice to maximise the log likelihood which will give the same maximum as log is monotonic.

\displaystyle \begin{aligned} \lambda(\boldsymbol{\theta}) &= \log \mathcal{L}(\boldsymbol{\theta}) \\ &= \sum_{i=1}^n {y^{(i)}}\log h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}) + (1 - y^{(i)})\log (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)})) \end{aligned}

In order to maximize the cost function, we again use the method of steepest ascent: if $\boldsymbol{\theta}^i$ is a guess for the parameters of the model then we can improve the guess by stepping a small distance in the direction of greatest change.

$\displaystyle \boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}^{i} - \gamma \nabla\mathcal{J}(\boldsymbol{\theta})$

$\gamma$ is some constant known in machine learning as the learning rate. It must be chosen to be large enough to ensure convergence within a reasonable number of steps but not so large that the algorithm fails to converge.

When the number of observations is high then the cost of evaluating the cost function can be high; as a cheaper alternative we can use stochastic gradient descent. Instead of taking the gradient with respect to all the observations, we take the gradient with respect to each observation in our data set. Of course if our data set is small we may have to use the data set several times to achieve convergence.

When the observations / training data are linearly separable then the magnitude of the parameters can grow without bound as the (parametized) logistic function then tends to the Heaviside / step function. Moreover, it is obvious that there can be more than one separaing hyperplane in this circumstance. To circumvent these infelicities, we instead maximize a penalized log likelihood function:

$\displaystyle \sum_{i=1}^n {y^{(i)}}\log h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)}) + (1 - y^{(i)})\log (1 - h_{\boldsymbol{\theta}}(\boldsymbol{x}^{(i)})) - \frac{\delta}{2}\|\boldsymbol{\theta}\|^2$

See Bishop and Mitchell for further details.

## Implementation

Some pragmas to warn us about potentially dangerous situations.

> {-# OPTIONS_GHC -Wall                    #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults  #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
>
> {-# LANGUAGE TupleSections #-}
>
> module Logistic ( betas
>                 , main
>                 , a
>                 , b
>                 , nSamples
>                 ) where


Modules from the automatic differentiation library.

> import Numeric.AD

> import qualified Data.Vector as V
> import Data.List
> import Text.Printf


Some modules from a random number generator library as we will want to generate some test data.

> import System.Random
> import Data.Random ()
> import Data.Random.Distribution.Beta
> import Data.RVar


Our model: the probability that $y$ has the label 1 given the observations $\boldsymbol{x}$.

> logit :: Floating a =>
>          a -> a
> logit x = 1 / (1 + exp (negate x))


For each observation, the log likelihood:

> logLikelihood :: Floating a => V.Vector a -> a -> V.Vector a -> a
> logLikelihood theta y x = y * log (logit z) +
>                           (1 - y) * log (1 - logit z)
>   where
>     z = V.sum $V.zipWith (*) theta x  > totalLogLikelihood :: Floating a => > V.Vector a -> > V.Vector a -> > V.Vector (V.Vector a) -> > a > totalLogLikelihood theta y x = (a - delta * b) / l > where > l = fromIntegral$ V.length y
>     a = V.sum $V.zipWith (logLikelihood theta) y x > b = (/2)$ V.sum $V.map (^2) theta  As before we can implement steepest descent using the grad function. > delTotalLogLikelihood :: Floating a => > V.Vector a -> > V.Vector (V.Vector a) -> > V.Vector a -> > V.Vector a > delTotalLogLikelihood y x = grad f > where > f theta = totalLogLikelihood theta > (V.map auto y) > (V.map (V.map auto) x) > > stepOnce :: Double -> > V.Vector Double -> > V.Vector (V.Vector Double) -> > V.Vector Double -> > V.Vector Double > stepOnce gamma y x theta = > V.zipWith (+) theta (V.map (* gamma)$ del theta)
>   where
>     del = delTotalLogLikelihood y x


Or even easier just use the library function gradientAscent!

> estimates :: (Floating a, Ord a) =>
>              V.Vector a ->
>              V.Vector (V.Vector a) ->
>              V.Vector a ->
>              [V.Vector a]
> estimates y x = gradientAscent $> \theta -> totalLogLikelihood theta > (V.map auto y) > (V.map (V.map auto) x)  Let’s try it out. First we need to generate some data. Rather arbitrarily let us create some populations from the beta distribution. > betas :: Int -> Double -> Double -> [Double] > betas n a b = > fst$ runState (replicateM n (sampleRVar (beta a b))) (mkStdGen seed)
>     where
>       seed = 0


We can plot the populations we wish to distinguish by sampling.

> a, b :: Double
> a          = 15
> b          = 6
> nSamples :: Int
> nSamples   = 100000
>
> sample0, sample1 :: [Double]
> sample0 = betas nSamples a b
> sample1 = betas nSamples b a


Note that in this case we could come up with a classification rule by inspecting the histograms. Furthermore, the populations overlap which means we will inevitably mis-classify some observations.

> mixSamples :: [Double] -> [Double] -> [(Double, Double)]
> mixSamples xs ys = unfoldr g ((map (0,) xs), (map (1,) ys))
>   where
>     g ([], [])         = Nothing
>     g ([],  _)         = Nothing
>     g ( _, [])         = Nothing
>     g ((x:xs), (y:ys)) = Just $(x, (y:ys, xs))  > createSample :: V.Vector (Double, Double) > createSample = V.fromList$ take 100 $mixSamples sample1 sample0  We create a model with one independent variables and thus two parameters. > actualTheta :: V.Vector Double > actualTheta = V.fromList [0.0, 1.0]  We initialise our algorithm with arbitrary values. > initTheta :: V.Vector Double > initTheta = V.replicate (V.length actualTheta) 0.1  Set the learning rate, the strength of the penalty term and the number of iterations. > gamma :: Double > gamma = 0.1 > > delta :: Floating a => a > delta = 1.0 > > nIters :: Int > nIters = 4000  Now we can run our example. For the constant parameter of our model (aka in machine learning as the bias) we ensure that the correspoding “independent variable” is always set to $1.0$. > vals :: V.Vector (Double, V.Vector Double) > vals = V.map (\(y, x) -> (y, V.fromList [1.0, x]))$ createSample

> main :: IO ()
> main = do
>   let u = V.map fst vals
>       v = V.map snd vals
>       hs = iterate (stepOnce gamma u v) initTheta
>       xs = V.map snd vals
>       theta =  head $drop nIters hs > theta' = head$ drop 100 $estimates u v initTheta > printf "Hand crafted descent: theta_0 = %5.3f, theta_1 = %5.3f\n" > (theta V.! 0) (theta V.! 1) > printf "Library descent: theta_0 = %5.3f, theta_1 = %5.3f\n" > (theta' V.! 0) (theta' V.! 1) > let predProbs = V.map (\x -> logit$ V.sum $V.zipWith (*) theta x) xs > mismatches = V.filter (> 0.5)$
>                    V.map abs $> V.zipWith (-) actuals preds > where > actuals = V.map fst vals > preds = V.map (\x -> fromIntegral$ fromEnum (x > 0.5))
>                           predProbs
>   let lActuals, lMisMatches :: Double
>       lActuals    = fromIntegral $V.length vals > lMisMatches = fromIntegral$ V.length mismatches
>   printf "%5.2f%% correct\n" \$
>          100.0 *  (lActuals - lMisMatches) / lActuals


And we get quite reasonable estimates:

ghci> main
Hand crafted descent: theta_0 = -2.071, theta_1 = 4.318
Library descent:      theta_0 = -2.070, theta_1 = 4.319
97.00% correct