Gaussian Error Linear Unit (GELU)

Arxiv link to the paper of GELU
Gaussian Error Linear Unit (GELU)
Definition of GELU

${\rm GELU}(x) = xP(X \leq x) = x\Phi(x) = x\cdot \frac{1}{2} \left[ 1+ {\rm erf}(x/\sqrt2) \right]$

Where $\Phi(\cdot)$ denotes cumulative distribution function (CDF) for normal distribution and ${\rm erf}(\cdot)$ is error function.
Derivative of GELU

$\begin{split} \frac{d}{dx} {\rm GELU}(x) &= \frac{d}{dx}\left[x\cdot\Phi(x)\right]\\ &= \Phi(x) + x \cdot \frac{d}{dx}\Phi(x)\\ &= \Phi(x) + x \cdot \varphi(x) \end{split}$

where $\varphi(x)$ stands for probability density function of Normal distribution.

Or

$\begin{split} \frac{d}{dx} {\rm GELU}(x) &= \frac{d}{dx}\left[\frac{1}{2}x + \frac{1}{2}x\cdot{\rm erf}(x/\sqrt2)\right]\\ &= \frac{1}{2}\left[1 + {\rm erf}(x/\sqrt2) + x \cdot \sqrt\frac{2}{\pi} e^{-x^2/2} \right]\\ &= \frac{1}{2}\left[1 + {\rm erf}(x/\sqrt2) \right] + x \cdot \sqrt\frac{1}{2\pi} e^{-x^2/2} \\ &= \Phi(x) + x \cdot \varphi(x) \end{split}$
Tanh Approximation
By the property of $erf$

$\begin{split} {\rm erf}(x/\sqrt2) &\approx \tanh\left[\frac{2}{\sqrt\pi}\left(\frac{x}{\sqrt2}+\frac{11}{123}(\frac{x}{\sqrt2})^3\right)\right]\\ &= \tanh\left[\sqrt\frac{2}{\pi}\left(x + \frac{11}{123}\frac{x^3}{2}\right)\right]\\ &\approx \tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right] \end{split}$

We we can obtain the following approximation

${\rm GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right]\right)$
Derivative of Tanh Approximation

$\begin{split} &\frac{d}{dx}\left[0.5x\left(1 + \tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right]\right)\right]\\ =& 0.5 + 0.5\tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right]\\ & + 0.5x\left(\sqrt{2/\pi}\left(1 + 3\times0.044715x^2\right)\right)\left(1-\tanh^2\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right]\right)\\ =& 0.5 + 0.5\tanh[\cdots] + \sqrt{\frac{1}{2\pi}}\left(x + 3\times0.044715x^3\right)(1-\tanh^2[\cdots]) \end{split}$
Logistic Approximation

${\rm GELU}(x) = x\Phi(x) \approx x\sigma(1.702x) = \frac{x}{1+e^{-1.702x}}$
Derivative of Logistic Approximation

$\frac{d}{dx}\left[\frac{x}{1+e^{-1.702x}}\right] = \frac{1}{1+e^{-1.702x}}+\frac{{\footnotesize 1.702}xe^{-1.702x}}{(1+e^{-1.702x})^2}$

Error function

Wiki link to Error Function
Error Function
Definition of erf()
The related error function $\rm erf(x)$ gives the probability of a random variable, with normal distribution of mean 0 and variance 1/2 falling in the range $[-x,x]$ . That is:

$erf(x) = \int_{-x}^x \left. \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{t-\mu}{\sigma})^2}dt\right|^{\mu=0}_{\sigma=1/2} = \frac{2}{\sqrt\pi}\int_0^x e^{-t^2}dt$
Relationship between $erf$ and $\Phi$
$\Phi$ denotes cumulative distribution function for Standard Normal Distribution

$\begin{split} \Phi(x) &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{\frac{-t^2}{2}}dt\\ \tiny{(s=\frac{t}{\sqrt2})} &= \frac{1}{\sqrt{\pi}}\int_{-\infty}^{x/\sqrt2} e^{-s^2}ds\\ &= \frac{1}{\sqrt{\pi}}\int_{-\infty}^0 e^{-s^2}ds + \frac{1}{\sqrt{\pi}}\int_{0}^{x/\sqrt2} e^{-s^2}ds\\ &= \frac{1}{2} + \frac{1}{2}erf(\frac{x}{\sqrt2}) \end{split}$
Derivative

$\frac{d}{dx}erf(x) = \frac{d}{dx}\left(\frac{2}{\sqrt\pi}\int_0^x e^{-t^2}dt\right) = \frac{2}{\sqrt\pi} e^{-x^2}$
Numerical Approximation
Error_function#Approximation_with_elementary_functions | wiki

$erf(x) \approx \tanh\left[\frac{2}{\sqrt\pi}\left(x+\frac{11}{123}x^3\right)\right]$

How to find the tanh approximation for erf
The tanh approximation of the error function can be derived by using Taylor expansion of these two functions.

$\begin{split} \tanh(x) \xrightarrow[\text{expansion}]{\text{taylor}} &= \sum_{n=1}^\infty \frac{2^{2n}(2^{2n}-1)B_{2n}x^{2n-1}}{(2n)!},\quad |x|<\frac{\pi}{2}\\ &= x - \frac{x^3}{3} + \frac{2x^5}{15} - \frac{17x^7}{315} + \cdots\\ {\rm erf}(x) \xrightarrow[\text{expansion}]{\text{taylor}} &= \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n+1}}{n!(2n+1)}\\ &= \frac{2}{\sqrt\pi}\left(x-\frac{x^3}{3}+\frac{x^5}{10}-\frac{x^7}{42}+\frac{x^9}{216}-\cdots\right) \end{split}$

Let’s say

$z(x) = \tanh(a_0 + a_1x + a_2x^2 + a_3x^3) \approx {\rm erf}(x)$

Firstly, erf is an odd function, so $z(x)$ should also be an odd function to approxmate it. Therefore, we have $a_0=a_2=0$ .

$z(x) = \tanh(a_1x + a_3x^3)$

Use Taylor expansion to expand $z(x)$

$\tanh(a_1x + a_3x^3) = \fbox{$ \begin{array}{} a_1x &+& a_3x^3\\ &-& \frac{1}{3}a_1^3x^3 &-& a_1^2a_3x^5 &-& a_1a_3^2x^7 &-& \frac{1}{3}a_3^3x^9\\ & & &+& \frac{2}{15}a_1^5x^5 &+& \cdots\\ & & & & & - & \cdots \end{array}$}$

Compare the coefficients of the two expansions

$\begin{cases} a_1 = \frac{2}{\sqrt\pi}\\ a_3 - \frac{1}{3}a_1^3 = -\frac{2}{3\sqrt\pi} \end{cases} \Rightarrow \begin{cases} a_1 = \frac{2}{\sqrt\pi}\\ a_2 = \frac{2}{\sqrt\pi}\frac{(4-\pi)}{3\pi} \end{cases}$

We get this formula

$z(x) = \tanh\left[\frac{2}{\sqrt\pi}\left(x+\frac{\pi}{35}x^3\right)\right]$

which keeps the amplitude differences $\left\vert 1-\frac{\rm erf(x)}{\tanh(x)}\right\vert < 0.0005$ .

However, the thing is not ove yet. You must have noticed that the $a_2$ is $\frac{2}{\sqrt\pi}\frac{11}{123}$ on the wiki rather than $\frac{2}{\sqrt\pi}\frac{(4-\pi)}{3\pi}$ as we calculated. The answer is here:
How to minimize the maximum absolute difference between 2 functions?

Derivation for erf’s Taylor series

$\frac{d}{dx}erf(x) = \frac{2}{\sqrt\pi} e^{-x^2} \xrightarrow[\text{expansion}]{\text{taylor}} = \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n}}{n!}$

$e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!},\ \text{for x} \in \R$

Take the antiderivative of both sides with respect to x:

$\begin{split} erf(x) &= \frac{2}{\sqrt\pi}\int\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n}}{n!}\\ \text{\small uniform convergence} &= \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\int\frac{(-1)^n x^{2n}}{n!}\\ &= \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n+1}}{n!(2n+1)}+c_n\\ &= \frac{2}{\sqrt\pi}\left(x-\frac{x^3}{3}+\frac{x^5}{10}-\frac{x^7}{42}+\frac{x^9}{216}-\cdots\right) + \sum_{n=0}^\infty c_n \end{split}$

let $x=0$ , we conclude that the constant term $\sum_{n=0}^\infty c_n$ should be 0. And we get the Taylor expansion of error function as follow:

$\begin{split} erf(x) \xrightarrow[\text{expansion}]{\text{taylor}} &= \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n+1}}{n!(2n+1)}\\ &= \frac{2}{\sqrt\pi}\left(x-\frac{x^3}{3}+\frac{x^5}{10}-\frac{x^7}{42}+\frac{x^9}{216}-\cdots\right) \end{split}$

Logistic distribution

Wiki to Logistic distribution
Logistic distribution
Probability density function

$f(x;\mu,s) = \frac{e^{-\frac{x-\mu}{s}}}{s(1+e^{-\frac{x-\mu}{s}})^2}$
Cumulative distribution function

$F(x;\mu,s) = \frac{1}{1+e^{-\frac{x-\mu}{s}}}$
Logistic function
When $\mu=0$ and $s=1$ , we get the logistic function that are often used in machine learning

$f(x) = \frac{1}{1+e^{-x}}$
Approximation for $\Phi(x)$
The shapes of CDF for logistic and normal distribution are similar——both are bell shaped. We can use the logistic CDF, which has a simpler formula for calculation, to approximate the CDF $\Phi(x)$ for standard normal distribution.
Obviously, their $\mu$ should be same, both zero. Thus, there are only one parameter left for logistic CDF to determine——the $s$ . Let $\gamma = \frac{1}{s}$ , we try to find the gamma that minimizes the difference between the two CDFs

$\arg\min_{\gamma}\max_x \left| \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{\frac{-t^2}{2}}dt - \frac{1}{1+e^{-\gamma x}} \right|$

Using the generalized reduced gradient algorithm, the parameter $\gamma$ is determined by minimizing the maximum deviation between the normal and logistic CDFs. The maximum deviation of 0.0095 occurs at $x = \pm 0.57$ for $\gamma =1.702$ . Therefore, the following equation gives the best logistic fit for the CDF of standard normal distribution $\Phi(x)$ :

$\Phi(x) \approx F(x) = \frac{1}{1+e^{-1.702x}}$