# Regression and Automated Differentiation

## Introduction

Automated differentiation was developed in the 1960’s but even now does not seem to be that widely used. Even experienced and knowledgeable practitioners often assume it is either a finite difference method or symbolic computation when it is neither.

This article gives a very simple application of it in a machine learning / statistics context.

## Multivariate Linear Regression

We model a dependent variable linearly dependent on some set of independent variables in a noisy environment.

$\displaystyle y^{(i)} = \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)} + \epsilon^{(i)}$

where

• $i$ runs from 1 to $n$, the number of observations;

• $\epsilon^{(i)}$ are i.i.d. normal with mean $0$ and the same variance $\sigma^2$: $\epsilon^{(i)} \sim \mathcal{N} (0,\sigma^2)$;

• For each $i$, $\boldsymbol{x^{(i)}}$ is a column vector of size $m$ and

• $\boldsymbol{\theta}$ is a column vector also of size $m$.

In other words:

$\displaystyle p(y^{(i)} \mid \boldsymbol{x}^{(i)}; \boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big)$

We can therefore write the likelihood function given all the observations as:

$\displaystyle \mathcal{L}(\boldsymbol{\theta}; X, \boldsymbol{y}) = \prod_{i = 1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big)$

In order to find the best fitting parameters $\boldsymbol{\theta}$ we therefore need to maximize this function with respect to $\boldsymbol{\theta}$. The standard approach is to maximize the log likelihood which, since log is monotonic, will give the same result.

\displaystyle \begin{aligned} \lambda(\boldsymbol{\theta}) &= \log \mathcal{L}(\boldsymbol{\theta}) \\ &= \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi}\sigma}\exp\big(\frac{-(y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2}{2\sigma^2}\big) \\ &= n\log \frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2\sigma^2}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2 \end{aligned}

Hence maximizing the likelihood is the same as minimizing the (biased) estimate of the variance:

$\displaystyle \frac{1}{n}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2$

We can define a cost function:

$\displaystyle \mathcal{J}(\boldsymbol{\theta}) = \frac{1}{2n}\sum_{i=1}^n (y^{(i)} - \boldsymbol{\theta}^{T}\boldsymbol{x}^{(i)})^2$

Clearly minimizing this will give the same result. The constant $1/2$ is to make the manipulation of the derivative easier. In our case, this is irrelevant as we are not going to derive the derivative explicitly but use automated differentiation.

In order to mininize the cost function, we use the method of steepest ascent (or in this case descent): if $\boldsymbol{\theta}^i$ is a guess for the parameters of the model then we can improve the guess by stepping a small distance in the direction of greatest change.

$\displaystyle \boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}^{i} - \gamma \nabla\mathcal{J}(\boldsymbol{\theta})$

$\gamma$ is some constant known in machine learning as the learning rate. It must be chosen to be large enough to ensure convergence within a reasonable number of steps but not so large that the algorithm fails to converge.

When the number of observations is high then the cost of evaluating the cost function can be high; as a cheaper alternative we can use stochastic gradient descent. Instead of taking the gradient with respect to all the observations, we take the gradient with respect to each observation in our data set. Of course if our data set is small we may have to use the data set several times to achieve convergence.

## Implementation

Some pragmas to warn us about potentially dangerous situations.

> {-# OPTIONS_GHC -Wall                    #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults  #-}
> module Linear where

Modules from the automatic differentiation library.

> import qualified Data.Vector as V

Some modules from a random number generator library as we will want to generate some test data.

> import Data.Random ()
> import Data.Random.Distribution.Normal
> import Data.Random.Distribution.Uniform
> import Data.RVar

Our model: the predicted value of $y$ is $\hat{y}$ given the observations $\boldsymbol{x}$.

> yhat :: Floating a =>
>         V.Vector a ->
>         V.Vector a -> a
> yhat x theta = V.sum $V.zipWith (*) theta x For each observation, the “cost” of the difference between the actual value of $y$ and its predicted value. > cost :: Floating a => > V.Vector a -> > a -> > V.Vector a > -> a > cost theta y x = 0.5 * (y - yhat x theta)^2 To find its gradient we merely apply the operator grad. > delCost :: Floating a => > a -> > V.Vector a -> > V.Vector a -> > V.Vector a > delCost y x = grad$ \theta -> cost theta (auto y) (V.map auto x)

We can use the single observation cost function to define the total cost function.

> totalCost :: Floating a =>
>              V.Vector a ->
>              V.Vector a ->
>              V.Vector (V.Vector a)
>              -> a
> totalCost theta y x = (/l) $V.sum$ V.zipWith (cost theta) y x
>   where
>     l = fromIntegral $V.length y Again taking the derivative is straightforward. > delTotalCost :: Floating a => > V.Vector a -> > V.Vector (V.Vector a) -> > V.Vector a -> > V.Vector a > delTotalCost y x = grad f > where > f theta = totalCost theta (V.map auto y) (V.map (V.map auto) x) Now we can implement steepest descent. > stepOnce :: Double -> > V.Vector Double -> > V.Vector (V.Vector Double) -> > V.Vector Double -> > V.Vector Double > stepOnce gamma y x theta = > V.zipWith (-) theta (V.map (* gamma)$ del theta)
>   where
>     del = delTotalCost y x
> stepOnceStoch :: Double ->
>                  Double ->
>                  V.Vector Double ->
>                  V.Vector Double ->
>                  V.Vector Double
> stepOnceStoch gamma y x theta =
>   V.zipWith (-) theta (V.map (* gamma) $del theta) > where > del = delCost y x Let’s try it out. First we need to generate some data. > createSample :: Double -> V.Vector Double -> IO (Double, V.Vector Double) > createSample sigma2 theta = do > let l = V.length theta > x <- V.sequence$ V.replicate (l - 1) $sampleRVar stdUniform > let mu = (theta V.! 0) + yhat x (V.drop 1 theta) > y <- sampleRVar$ normal mu sigma2
>   return (y, x)

We create a model with two independent variables and thus three parameters.

> actualTheta :: V.Vector Double
> actualTheta = V.fromList [0.0, 0.6, 0.7]

We initialise our algorithm with arbitrary values.

> initTheta :: V.Vector Double
> initTheta = V.replicate 3 0.1

We give our model an arbitrary variance.

> sigma2 :: Double
> sigma2 = 0.01

And set the learning rate and the number of iterations.

> nSamples, nIters:: Int
> nSamples = 100
> nIters = 2000
> gamma :: Double
> gamma = 0.1

Now we can run our example. For the constant parameter of our model (aka in machine learning as the bias) we ensure that the correspoding “independent variable” is always set to $1.0$.

> main :: IO ()
> main = do
>   vals <- V.sequence $> V.replicate nSamples$
>           createSample sigma2 actualTheta
>   let y = V.map fst vals
>       x = V.map snd vals
>       x' =  V.map (V.cons 1.0) x
>       hs = iterate (stepOnce gamma y x') initTheta
>       update theta = V.foldl f theta $V.zip y x' > where > f theta (y, x) = stepOnceStoch gamma y x theta > putStrLn$ show $take 1$ drop nIters hs
>   let f = foldr (.) id $replicate nSamples update > putStrLn$ show \$ f initTheta

And we get quite reasonable estimates for the parameter.

ghci> main
[fromList [-8.34526742975572e-4,0.6024033722648041,0.69483650585735]]
fromList [7.095387239884274e-5,0.6017197904731632,0.694335002002961]

## 15 thoughts on “Regression and Automated Differentiation”

1. Nice example. You might want to try the gradientDescent routine, which should take over the optimisation bit. Also, it is a one-liner to goose up the model from something linear into something fancier like a multilayer perceptron: fmap atanh . linearMap w2 . fmap atanh . linearMap w1 …

• Dominic Steinitz

Yes I was going to do logistic regression next followed by the multi-layer perception. It’s something of a mystery why the Machine Learning community seems to focus on backpropagation when they could just apply automated differentiation and gradient descent. Also I was aware of the gradientDescent routine so I should use it for the logistic regression example

2. You may also be able to use one of the other gradientDescent variants, e.g. conjugateGradientDescent, or the Halley version depending on your problem.

We’ve begun actively exploring adding more gradient descent techniques (BFGS, etc.) to the library of late.

3. I’ve always viewed AD as symbolic differentiation where the compiler does the symbolic manipulation: is there something wrong with that view?

4. Ganesh,

Yes.

With automatic differentiation we don’t push symbols even in the compiler, we use the partial derivatives of each primitive operation and either collapse them as we go (in forward mode) or build a tree with just the partial derivatives with regards to the particular arguments it was given (in reverse mode).

Note we only store the partial derivatives in that tree with respect to the actual inputs given, we don’t store a tree node with the actual operation.

There is no need to store a node that says this thing is ‘Sin’ or ‘*’.

This makes a big difference compared to symbolic differentiation. In the absence of a ‘let’ in the grammar for your symbolically differentiated form, that difference is actually asymptotic in the time it takes to evaluate. Symbolic solutions without sharing are based on the expanded tree, and AD solutions using forward mode can get sharing on the values with no extra effort.

The nice thing about using “automatic” differentiation as opposed to a purer symbolic approach is that as you go you have the ‘primal’ answer, so if you need to use conditional branching based on that primal answer, etc it still just works.

You can derive symbolic differentiation from automatic differentiation at least for a given code path, by just using a traced numeric type, and lifting it into AD and in reverse mode you have to use observable sharing or other nastiness to cheat, which is of course open to symbolic approaches as well, so in practice this distinction may seem a bit pedantic, but it still makes a huge difference in implementation strategy and use.

5. There is no need to store a node that says this thing is ‘Sin’ or ‘*’.

Doesn’t your source program (manipulated by the compiler) contain those nodes?

6. In the definition for ‘sin’ for each mode we wind up with something like:

cos = lift1 cos (negate . sin)

where lift1 is a function that defines how to compute the answer as normal and its partial derivative with respect to its input.

if we were using a mode like a simple 1-step forward mode like

data Dual a = Dual a a

then

lift1 f df (Dual b db) = Dual (f b) (df b * db)

In a truly symbolic story, I’d be building up some kind of tree

data Symbolic a = Val a | Cos (Symbolic a) | Symbolic a :*: Symbolic a | …

cos = Cos

Here, the source program still exists, yes, so as i said, the distinction is a bit pedantic, but I also get the full power of the source language.

I view automatic differentiation as working on the quotient of the information exposed to an arbitrary symbolic differentiator that is actually interesting to the problem of differentiation — that tree of derivatives, not the operations that got me there.

There is a relationship between them that is even nodded to in the wikipedia article for AD, though, and with the addition of symbolic numeric types the line becomes even more blurry.

7. I guess it’s just a question of emphasis: I think that AD is a special case of symbolic differentiation where the machinery of the differentiation ends up being managed by the compiler; the AST the compiler manipulates does have the kind of representation embodied by Symbolic.

8. I mostly find it a choice of where the operation happens.

One can view symbolic as working on the initial representation as a syntax tree and automatic as working on the final representation in terms of composing primitive operations with known Jacobians, where now the only choice remaining is which order to do the composition.

Even that description in the wikipedia article acknowledges that if you pull far enough back to the origin mathematics, you get something morally symbolic out where you initially produce the code, but by pushing it down into the code, we just calculate the derivative for the value you are using while we calculate f(x), rather than plug it into an explicit formula for f'(x).

9. mekwasabi

You juggle effortlessly two ideas close to my researcher’s heart: machine learning and functional programming (though I’ve never implemented ML in FP, but I reckon there’s plenty of opportunity here).
So if I understand correctly, your AD library requires (invertible,terminating) functions with known Jacobians as respectively the first and second argument to lift1 (how is the operator that maps a manifold to its tangent bundle called?)
Moreover, where can I find some MWEs of your AD library? I’ve cabal installed it, for the time being

10. Dominic Steinitz

I am not sure to whom you are addressing your comment. The AD library was written by (I think) Edward Kmett and Barak Pearlmutter. I just used it in my blog article on Extended Kalman Filters (https://idontgetoutmuch.wordpress.com/2014/09/09/fun-with-extended-kalman-filters-4/). I have probably used it in other places. I don’t think the functions need to be invertible just differentiable. Hope this helps.

Way back in the day Pearlmutter and Siskind had written a ton of beautiful papers on “Lambda the Ultimate Backpropagator”, etc.

Pearlmutter and Siskind had used a fresh tag supply to get this safety in stalingrad, their compiler for a scheme-like with built-in AD. Björn’s adaptation let them approximate this in Haskell at least in a setting where the combinators are properly nested. Some things like ‘flip grad’ ceased to be well-typed, so some flexibility was lost, and rank-n types can be more painful to work with, but correctness was preserved.

Somewhere along the way Conal Elliott wrote a nice article on Beautiful Differentiation describing forward mode, packaging up the sort of folklore of the time on AD.

Barak Pearlmutter wrote “fad” with Bjorn Buckwalter and Jeffrey Mark Siskind, which worked on a lazy forward mode derivative tower. The difference between Barak et al. and Conal’s encoding is that Björn had borrowed a trick from an article by Chung-Chieh Shan on his blog to avoid perturbation/sensitivity confusion problems.

I build a reverse mode AD library “rad” based loosely on its API, and stole Björn’s trick for using quantifiers to “brand” your infinitesimals in response to a stack overflow question. To do this I used Andy Gill’s work on “observable sharing” from Kansas Lava, so my reverse mode was rather non-standard and instead based on a topological sort of the graph rather than a linear tape.

I then went off and built a framework for doing reverse mode via tapes (Wengert lists), various sparse and dense variants on forward mode, etc. hiding the choice of mode behind the quantifier Chung-Chieh introduced.

Anthony Cowley helped me scale the “ad” package up to hundreds of thousands of variables to better support his work on computer vision and motion planning for robots.

However, the decision of how to “hide the mode” ultimately turned out to be limiting the end user’s ability to define new primitive numeric operations, so we redesigned the library and ad 4 was released after a considerable overhaul by me and Alex Lang.

12. Invertability is definitely not required. Termination is only required if you look at it.😉

13. mekwasabi

thank you Dominic and Edward for the clarifications and inspiring work.

Unfortunately I cannot get these examples to work due to a build error in one of random-fu’s dependencies ( I already filed a bug report), but I will soon try the AD library.