Regression and Automated Differentiation

Introduction

Automated differentiation was developed in the 1960’s but even now does not seem to be that widely used. Even experienced and knowledgeable practitioners often assume it is either a finite difference method or symbolic computation when it is neither.

This article gives a very simple application of it in a machine learning / statistics context.

Multivariate Linear Regression

We model a dependent variable linearly dependent on some set of independent variables in a noisy environment.

\displaystyle   y^{(i)} = \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)} + \epsilon^{(i)}

where

  • i runs from 1 to n, the number of observations;

  • \epsilon^{(i)} are i.i.d. normal with mean 0 and the same variance \sigma^2: \epsilon^{(i)} \sim \mathcal{N} (0,\sigma^2);

  • For each i, \boldsymbol{x^{(i)}} is a column vector of size m and

  • \boldsymbol{\theta} is a column vector also of size m.

In other words:

\displaystyle   p(y^{(i)} \mid \boldsymbol{x}^{(i)}; \boldsymbol{\theta}) =  \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big)

We can therefore write the likelihood function given all the observations as:

\displaystyle   \mathcal{L}(\boldsymbol{\theta}; X, \boldsymbol{y}) =  \prod_{i = 1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big)

In order to find the best fitting parameters \boldsymbol{\theta} we therefore need to maximize this function with respect to \boldsymbol{\theta}. The standard approach is to maximize the log likelihood which, since log is monotonic, will give the same result.

\displaystyle   \begin{aligned}  \lambda(\boldsymbol{\theta}) &= \log \mathcal{L}(\boldsymbol{\theta}) \\                               &= \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big) \\                               &= n\log \frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2\sigma^2}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2  \end{aligned}

Hence maximizing the likelihood is the same as minimizing the (biased) estimate of the variance:

\displaystyle   \frac{1}{n}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2

We can define a cost function:

\displaystyle   \mathcal{J}(\boldsymbol{\theta}) = \frac{1}{2n}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2

Clearly minimizing this will give the same result. The constant 1/2 is to make the manipulation of the derivative easier. In our case, this is irrelevant as we are not going to derive the derivative explicitly but use automated differentiation.

In order to mininize the cost function, we use the method of steepest ascent (or in this case descent): if \boldsymbol{\theta}^i is a guess for the parameters of the model then we can improve the guess by stepping a small distance in the direction of greatest change.

\displaystyle   \boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}^{i} - \gamma \nabla\mathcal{J}(\boldsymbol{\theta})

\gamma is some constant known in machine learning as the learning rate. It must be chosen to be large enough to ensure convergence within a reasonable number of steps but not so large that the algorithm fails to converge.

When the number of observations is high then the cost of evaluating the cost function can be high; as a cheaper alternative we can use stochastic gradient descent. Instead of taking the gradient with respect to all the observations, we take the gradient with respect to each observation in our data set. Of course if our data set is small we may have to use the data set several times to achieve convergence.

Implementation

Some pragmas to warn us about potentially dangerous situations.

> {-# OPTIONS_GHC -Wall                    #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults  #-}
> module Linear where

Modules from the automatic differentiation library.

> import Numeric.AD
> import Numeric.AD.Types
> import qualified Data.Vector as V

Some modules from a random number generator library as we will want to generate some test data.

> import Data.Random ()
> import Data.Random.Distribution.Normal
> import Data.Random.Distribution.Uniform
> import Data.RVar

Our model: the predicted value of y is \hat{y} given the observations \boldsymbol{x}.

> yhat :: Floating a =>
>         V.Vector a ->
>         V.Vector a -> a
> yhat x theta = V.sum $ V.zipWith (*) theta x

For each observation, the “cost” of the difference between the actual value of y and its predicted value.

> cost :: Floating a =>
>         V.Vector a ->
>         a ->
>         V.Vector a
>         -> a
> cost theta y x = 0.5 * (y - yhat x theta)^2

To find its gradient we merely apply the operator grad.

> delCost :: Floating a =>
>            a ->
>            V.Vector a ->
>            V.Vector a ->
>            V.Vector a
> delCost y x = grad $ \theta -> cost theta (auto y) (V.map auto x)

We can use the single observation cost function to define the total cost function.

> totalCost :: Floating a =>
>              V.Vector a ->
>              V.Vector a ->
>              V.Vector (V.Vector a)
>              -> a
> totalCost theta y x = (/l) $ V.sum $ V.zipWith (cost theta) y x
>   where
>     l = fromIntegral $ V.length y

Again taking the derivative is straightforward.

> delTotalCost :: Floating a =>
>                 V.Vector a ->
>                 V.Vector (V.Vector a) ->
>                 V.Vector a ->
>                 V.Vector a
> delTotalCost y x = grad f
>   where
>     f theta = totalCost theta (V.map auto y) (V.map (V.map auto) x)

Now we can implement steepest descent.

> stepOnce :: Double ->
>             V.Vector Double ->
>             V.Vector (V.Vector Double) ->
>             V.Vector Double ->
>             V.Vector Double
> stepOnce gamma y x theta =
>   V.zipWith (-) theta (V.map (* gamma) $ del theta)
>   where
>     del = delTotalCost y x
> stepOnceStoch :: Double ->
>                  Double ->
>                  V.Vector Double ->
>                  V.Vector Double ->
>                  V.Vector Double
> stepOnceStoch gamma y x theta =
>   V.zipWith (-) theta (V.map (* gamma) $ del theta)
>   where
>     del = delCost y x

Let’s try it out. First we need to generate some data.

> createSample :: Double -> V.Vector Double -> IO (Double, V.Vector Double)
> createSample sigma2 theta = do
>   let l = V.length theta
>   x <- V.sequence $ V.replicate (l - 1) $ sampleRVar stdUniform
>   let mu = (theta V.! 0) + yhat x (V.drop 1 theta)
>   y <- sampleRVar $ normal mu sigma2
>   return (y, x)

We create a model with two independent variables and thus three parameters.

> actualTheta :: V.Vector Double
> actualTheta = V.fromList [0.0, 0.6, 0.7]

We initialise our algorithm with arbitrary values.

> initTheta :: V.Vector Double
> initTheta = V.replicate 3 0.1

We give our model an arbitrary variance.

> sigma2 :: Double
> sigma2 = 0.01

And set the learning rate and the number of iterations.

> nSamples, nIters:: Int
> nSamples = 100
> nIters = 2000
> gamma :: Double
> gamma = 0.1

Now we can run our example. For the constant parameter of our model (aka in machine learning as the bias) we ensure that the correspoding “independent variable” is always set to 1.0.

> main :: IO ()
> main = do
>   vals <- V.sequence $
>           V.replicate nSamples $
>           createSample sigma2 actualTheta
>   let y = V.map fst vals
>       x = V.map snd vals
>       x' =  V.map (V.cons 1.0) x
>       hs = iterate (stepOnce gamma y x') initTheta
>       update theta = V.foldl f theta $ V.zip y x'
>         where
>           f theta (y, x) = stepOnceStoch gamma y x theta
>   putStrLn $ show $ take 1 $ drop nIters hs
>   let f = foldr (.) id $ replicate nSamples update
>   putStrLn $ show $ f initTheta

And we get quite reasonable estimates for the parameter.

ghci> main
  [fromList [-8.34526742975572e-4,0.6024033722648041,0.69483650585735]]
  fromList [7.095387239884274e-5,0.6017197904731632,0.694335002002961]
About these ads
This entry was posted in Haskell, Machine Learning, Probability, Statistics, Uncategorized. Bookmark the permalink.

10 Responses to Regression and Automated Differentiation

  1. Nice example. You might want to try the `gradientDescent` routine, which should take over the optimisation bit. Also, it is a one-liner to goose up the model from something linear into something fancier like a multilayer perceptron: `fmap atanh . linearMap w2 . fmap atanh . linearMap w1 …`

    • Dominic Steinitz says:

      Yes I was going to do logistic regression next followed by the multi-layer perception. It’s something of a mystery why the Machine Learning community seems to focus on backpropagation when they could just apply automated differentiation and gradient descent. Also I was aware of the gradientDescent routine so I should use it for the logistic regression example :-)

  2. You may also be able to use one of the other gradientDescent variants, e.g. conjugateGradientDescent, or the Halley version depending on your problem.

    We’ve begun actively exploring adding more gradient descent techniques (BFGS, etc.) to the library of late.

  3. I’ve always viewed AD as symbolic differentiation where the compiler does the symbolic manipulation: is there something wrong with that view?

  4. Ganesh,

    Yes.

    With automatic differentiation we don’t push symbols even in the compiler, we use the partial derivatives of each primitive operation and either collapse them as we go (in forward mode) or build a tree with just the partial derivatives with regards to the particular arguments it was given (in reverse mode).

    Note we only store the partial derivatives in that tree with respect to the actual inputs given, we don’t store a tree node with the actual operation.

    There is no need to store a node that says this thing is ‘Sin’ or ‘*’.

    This makes a big difference compared to symbolic differentiation. In the absence of a ‘let’ in the grammar for your symbolically differentiated form, that difference is actually asymptotic in the time it takes to evaluate. Symbolic solutions without sharing are based on the expanded tree, and AD solutions using forward mode can get sharing on the values with no extra effort.

    The nice thing about using “automatic” differentiation as opposed to a purer symbolic approach is that as you go you have the ‘primal’ answer, so if you need to use conditional branching based on that primal answer, etc it still just works.

    You can derive symbolic differentiation from automatic differentiation at least for a given code path, by just using a traced numeric type, and lifting it into AD and in reverse mode you have to use observable sharing or other nastiness to cheat, which is of course open to symbolic approaches as well, so in practice this distinction may seem a bit pedantic, but it still makes a huge difference in implementation strategy and use.

  5. There is no need to store a node that says this thing is ‘Sin’ or ‘*’.

    Doesn’t your source program (manipulated by the compiler) contain those nodes?

  6. In the definition for ‘sin’ for each mode we wind up with something like:

    cos = lift1 cos (negate . sin)

    where lift1 is a function that defines how to compute the answer as normal and its partial derivative with respect to its input.

    if we were using a mode like a simple 1-step forward mode like

    data Dual a = Dual a a

    then

    lift1 f df (Dual b db) = Dual (f b) (df b * db)

    In a truly symbolic story, I’d be building up some kind of tree

    data Symbolic a = Val a | Cos (Symbolic a) | Symbolic a :*: Symbolic a | …

    cos = Cos

    Here, the source program still exists, yes, so as i said, the distinction is a bit pedantic, but I also get the full power of the source language.

    I view automatic differentiation as working on the quotient of the information exposed to an arbitrary symbolic differentiator that is actually interesting to the problem of differentiation — that tree of derivatives, not the operations that got me there.

    There is a relationship between them that is even nodded to in the wikipedia article for AD, though, and with the addition of symbolic numeric types the line becomes even more blurry.

  7. I guess it’s just a question of emphasis: I think that AD is a special case of symbolic differentiation where the machinery of the differentiation ends up being managed by the compiler; the AST the compiler manipulates does have the kind of representation embodied by Symbolic.

  8. I mostly find it a choice of where the operation happens.

    One can view symbolic as working on the initial representation as a syntax tree and automatic as working on the final representation in terms of composing primitive operations with known Jacobians, where now the only choice remaining is which order to do the composition.

    Even that description in the wikipedia article acknowledges that if you pull far enough back to the origin mathematics, you get something morally symbolic out where you initially produce the code, but by pushing it down into the code, we just calculate the derivative for the value you are using while we calculate f(x), rather than plug it into an explicit formula for f'(x).

  9. Pingback: Logistic Regression and Automated Differentiation | Idontgetoutmuch's Weblog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s