Gaussian Error Linear Unit (GELU)

  • Arxiv link to the paper of GELU
    Gaussian Error Linear Unit (GELU)

  • Definition of GELU

    GELU(x)=xP(Xx)=xΦ(x)=x12[1+erf(x/2)]{\rm GELU}(x) = xP(X \leq x) = x\Phi(x) = x\cdot \frac{1}{2} \left[ 1+ {\rm erf}(x/\sqrt2) \right]

    Where Φ()\Phi(\cdot) denotes cumulative distribution function (CDF) for normal distribution and erf(){\rm erf}(\cdot) is error function.

  • Derivative of GELU

    ddxGELU(x)=ddx[xΦ(x)]=Φ(x)+xddxΦ(x)=Φ(x)+xφ(x)\begin{split} \frac{d}{dx} {\rm GELU}(x) &= \frac{d}{dx}\left[x\cdot\Phi(x)\right]\\ &= \Phi(x) + x \cdot \frac{d}{dx}\Phi(x)\\ &= \Phi(x) + x \cdot \varphi(x) \end{split}

    where φ(x)\varphi(x) stands for probability density function of Normal distribution.

    Or

    ddxGELU(x)=ddx[12x+12xerf(x/2)]=12[1+erf(x/2)+x2πex2/2]=12[1+erf(x/2)]+x12πex2/2=Φ(x)+xφ(x)\begin{split} \frac{d}{dx} {\rm GELU}(x) &= \frac{d}{dx}\left[\frac{1}{2}x + \frac{1}{2}x\cdot{\rm erf}(x/\sqrt2)\right]\\ &= \frac{1}{2}\left[1 + {\rm erf}(x/\sqrt2) + x \cdot \sqrt\frac{2}{\pi} e^{-x^2/2} \right]\\ &= \frac{1}{2}\left[1 + {\rm erf}(x/\sqrt2) \right] + x \cdot \sqrt\frac{1}{2\pi} e^{-x^2/2} \\ &= \Phi(x) + x \cdot \varphi(x) \end{split}

  • Tanh Approximation
    By the property of erferf

    erf(x/2)tanh[2π(x2+11123(x2)3)]=tanh[2π(x+11123x32)]tanh[2/π(x+0.044715x3)]\begin{split} {\rm erf}(x/\sqrt2) &\approx \tanh\left[\frac{2}{\sqrt\pi}\left(\frac{x}{\sqrt2}+\frac{11}{123}(\frac{x}{\sqrt2})^3\right)\right]\\ &= \tanh\left[\sqrt\frac{2}{\pi}\left(x + \frac{11}{123}\frac{x^3}{2}\right)\right]\\ &\approx \tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right] \end{split}

    We we can obtain the following approximation

    GELU(x)0.5x(1+tanh[2/π(x+0.044715x3)]){\rm GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right]\right)

  • Derivative of Tanh Approximation

    ddx[0.5x(1+tanh[2/π(x+0.044715x3)])]=0.5+0.5tanh[2/π(x+0.044715x3)]+0.5x(2/π(1+3×0.044715x2))(1tanh2[2/π(x+0.044715x3)])=0.5+0.5tanh[]+12π(x+3×0.044715x3)(1tanh2[])\begin{split} &\frac{d}{dx}\left[0.5x\left(1 + \tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right]\right)\right]\\ =& 0.5 + 0.5\tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right]\\ & + 0.5x\left(\sqrt{2/\pi}\left(1 + 3\times0.044715x^2\right)\right)\left(1-\tanh^2\left[\sqrt{2/\pi}\left(x + 0.044715x^3\right)\right]\right)\\ =& 0.5 + 0.5\tanh[\cdots] + \sqrt{\frac{1}{2\pi}}\left(x + 3\times0.044715x^3\right)(1-\tanh^2[\cdots]) \end{split}

  • Logistic Approximation

    GELU(x)=xΦ(x)xσ(1.702x)=x1+e1.702x{\rm GELU}(x) = x\Phi(x) \approx x\sigma(1.702x) = \frac{x}{1+e^{-1.702x}}

  • Derivative of Logistic Approximation

    ddx[x1+e1.702x]=11+e1.702x+1.702xe1.702x(1+e1.702x)2\frac{d}{dx}\left[\frac{x}{1+e^{-1.702x}}\right] = \frac{1}{1+e^{-1.702x}}+\frac{{\footnotesize 1.702}xe^{-1.702x}}{(1+e^{-1.702x})^2}

Error function

  • Wiki link to Error Function
    Error Function

  • Definition of erf()
    The related error function erf(x)\rm erf(x) gives the probability of a random variable, with normal distribution of mean 0 and variance 1/2 falling in the range [x,x][-x,x]. That is:

    erf(x)=xx1σ2πe12(tμσ)2dtσ=1/2μ=0=2π0xet2dterf(x) = \int_{-x}^x \left. \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{t-\mu}{\sigma})^2}dt\right|^{\mu=0}_{\sigma=1/2} = \frac{2}{\sqrt\pi}\int_0^x e^{-t^2}dt

  • Relationship between erferf and Φ\Phi
    Φ\Phi denotes cumulative distribution function for Standard Normal Distribution

    Φ(x)=12πxet22dt(s=t2)=1πx/2es2ds=1π0es2ds+1π0x/2es2ds=12+12erf(x2)\begin{split} \Phi(x) &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{\frac{-t^2}{2}}dt\\ \tiny{(s=\frac{t}{\sqrt2})} &= \frac{1}{\sqrt{\pi}}\int_{-\infty}^{x/\sqrt2} e^{-s^2}ds\\ &= \frac{1}{\sqrt{\pi}}\int_{-\infty}^0 e^{-s^2}ds + \frac{1}{\sqrt{\pi}}\int_{0}^{x/\sqrt2} e^{-s^2}ds\\ &= \frac{1}{2} + \frac{1}{2}erf(\frac{x}{\sqrt2}) \end{split}

  • Derivative

    ddxerf(x)=ddx(2π0xet2dt)=2πex2\frac{d}{dx}erf(x) = \frac{d}{dx}\left(\frac{2}{\sqrt\pi}\int_0^x e^{-t^2}dt\right) = \frac{2}{\sqrt\pi} e^{-x^2}

  • Numerical Approximation
    Error_function#Approximation_with_elementary_functions | wiki

    erf(x)tanh[2π(x+11123x3)]erf(x) \approx \tanh\left[\frac{2}{\sqrt\pi}\left(x+\frac{11}{123}x^3\right)\right]


  • How to find the tanh approximation for erf
    The tanh approximation of the error function can be derived by using Taylor expansion of these two functions.

    tanh(x)expansiontaylor=n=122n(22n1)B2nx2n1(2n)!,x<π2=xx33+2x51517x7315+erf(x)expansiontaylor=2πn=0(1)nx2n+1n!(2n+1)=2π(xx33+x510x742+x9216)\begin{split} \tanh(x) \xrightarrow[\text{expansion}]{\text{taylor}} &= \sum_{n=1}^\infty \frac{2^{2n}(2^{2n}-1)B_{2n}x^{2n-1}}{(2n)!},\quad |x|<\frac{\pi}{2}\\ &= x - \frac{x^3}{3} + \frac{2x^5}{15} - \frac{17x^7}{315} + \cdots\\ {\rm erf}(x) \xrightarrow[\text{expansion}]{\text{taylor}} &= \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n+1}}{n!(2n+1)}\\ &= \frac{2}{\sqrt\pi}\left(x-\frac{x^3}{3}+\frac{x^5}{10}-\frac{x^7}{42}+\frac{x^9}{216}-\cdots\right) \end{split}

    Let’s say

    z(x)=tanh(a0+a1x+a2x2+a3x3)erf(x)z(x) = \tanh(a_0 + a_1x + a_2x^2 + a_3x^3) \approx {\rm erf}(x)

    Firstly, erf is an odd function, so z(x)z(x) should also be an odd function to approxmate it. Therefore, we have a0=a2=0a_0=a_2=0.

    z(x)=tanh(a1x+a3x3)z(x) = \tanh(a_1x + a_3x^3)

    Use Taylor expansion to expand z(x)z(x)

    tanh(a1x+a3x3)=a1x+a3x313a13x3a12a3x5a1a32x713a33x9+215a15x5+\tanh(a_1x + a_3x^3) = \fbox{$ \begin{array}{} a_1x &+& a_3x^3\\ &-& \frac{1}{3}a_1^3x^3 &-& a_1^2a_3x^5 &-& a_1a_3^2x^7 &-& \frac{1}{3}a_3^3x^9\\ & & &+& \frac{2}{15}a_1^5x^5 &+& \cdots\\ & & & & & - & \cdots \end{array}$}

    Compare the coefficients of the two expansions

    {a1=2πa313a13=23π{a1=2πa2=2π(4π)3π\begin{cases} a_1 = \frac{2}{\sqrt\pi}\\ a_3 - \frac{1}{3}a_1^3 = -\frac{2}{3\sqrt\pi} \end{cases} \Rightarrow \begin{cases} a_1 = \frac{2}{\sqrt\pi}\\ a_2 = \frac{2}{\sqrt\pi}\frac{(4-\pi)}{3\pi} \end{cases}

    We get this formula

    z(x)=tanh[2π(x+π35x3)]z(x) = \tanh\left[\frac{2}{\sqrt\pi}\left(x+\frac{\pi}{35}x^3\right)\right]

    which keeps the amplitude differences 1erf(x)tanh(x)<0.0005\left\vert 1-\frac{\rm erf(x)}{\tanh(x)}\right\vert < 0.0005.

    However, the thing is not ove yet. You must have noticed that the a2a_2 is 2π11123\frac{2}{\sqrt\pi}\frac{11}{123} on the wiki rather than 2π(4π)3π\frac{2}{\sqrt\pi}\frac{(4-\pi)}{3\pi} as we calculated. The answer is here:
    How to minimize the maximum absolute difference between 2 functions?


  • Derivation for erf’s Taylor series

    ddxerf(x)=2πex2expansiontaylor=2πn=0(1)nx2nn!\frac{d}{dx}erf(x) = \frac{2}{\sqrt\pi} e^{-x^2} \xrightarrow[\text{expansion}]{\text{taylor}} = \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n}}{n!}

    ex=n=0xnn!, for xRe^x = \sum_{n=0}^{\infty} \frac{x^n}{n!},\ \text{for x} \in \R

    Take the antiderivative of both sides with respect to x:

    erf(x)=2πn=0(1)nx2nn!uniform convergence=2πn=0(1)nx2nn!=2πn=0(1)nx2n+1n!(2n+1)+cn=2π(xx33+x510x742+x9216)+n=0cn\begin{split} erf(x) &= \frac{2}{\sqrt\pi}\int\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n}}{n!}\\ \text{\small uniform convergence} &= \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\int\frac{(-1)^n x^{2n}}{n!}\\ &= \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n+1}}{n!(2n+1)}+c_n\\ &= \frac{2}{\sqrt\pi}\left(x-\frac{x^3}{3}+\frac{x^5}{10}-\frac{x^7}{42}+\frac{x^9}{216}-\cdots\right) + \sum_{n=0}^\infty c_n \end{split}

    let x=0x=0, we conclude that the constant term n=0cn\sum_{n=0}^\infty c_n should be 0. And we get the Taylor expansion of error function as follow:

    erf(x)expansiontaylor=2πn=0(1)nx2n+1n!(2n+1)=2π(xx33+x510x742+x9216)\begin{split} erf(x) \xrightarrow[\text{expansion}]{\text{taylor}} &= \frac{2}{\sqrt\pi}\sum_{n=0}^{\infty}\frac{(-1)^n x^{2n+1}}{n!(2n+1)}\\ &= \frac{2}{\sqrt\pi}\left(x-\frac{x^3}{3}+\frac{x^5}{10}-\frac{x^7}{42}+\frac{x^9}{216}-\cdots\right) \end{split}

Logistic distribution

  • Wiki to Logistic distribution
    Logistic distribution

  • Probability density function

    f(x;μ,s)=exμss(1+exμs)2f(x;\mu,s) = \frac{e^{-\frac{x-\mu}{s}}}{s(1+e^{-\frac{x-\mu}{s}})^2}

  • Cumulative distribution function

    F(x;μ,s)=11+exμsF(x;\mu,s) = \frac{1}{1+e^{-\frac{x-\mu}{s}}}

  • Logistic function
    When μ=0\mu=0 and s=1s=1, we get the logistic function that are often used in machine learning

    f(x)=11+exf(x) = \frac{1}{1+e^{-x}}

  • Approximation for Φ(x)\Phi(x)
    The shapes of CDF for logistic and normal distribution are similar——both are bell shaped. We can use the logistic CDF, which has a simpler formula for calculation, to approximate the CDF Φ(x)\Phi(x) for standard normal distribution.
    Obviously, their μ\mu should be same, both zero. Thus, there are only one parameter left for logistic CDF to determine——the ss. Let γ=1s\gamma = \frac{1}{s}, we try to find the gamma that minimizes the difference between the two CDFs

    argminγmaxx12πxet22dt11+eγx\arg\min_{\gamma}\max_x \left| \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{\frac{-t^2}{2}}dt - \frac{1}{1+e^{-\gamma x}} \right|

    Using the generalized reduced gradient algorithm, the parameter γ\gamma is determined by minimizing the maximum deviation between the normal and logistic CDFs. The maximum deviation of 0.0095 occurs at x=±0.57x = \pm 0.57 for γ=1.702\gamma =1.702. Therefore, the following equation gives the best logistic fit for the CDF of standard normal distribution Φ(x)\Phi(x):

    Φ(x)F(x)=11+e1.702x\Phi(x) \approx F(x) = \frac{1}{1+e^{-1.702x}}