Gaussian Error Linear Unit (GELU)
Gaussian Error Linear Unit (GELU)
-
Arxiv link to the paper of GELU
Gaussian Error Linear Unit (GELU) -
Definition of GELU
GELU(x)=xP(X≤x)=xΦ(x)=x⋅21[1+erf(x/2)]Where Φ(⋅) denotes cumulative distribution function (CDF) for normal distribution and erf(⋅) is error function.
-
Derivative of GELU
dxdGELU(x)=dxd[x⋅Φ(x)]=Φ(x)+x⋅dxdΦ(x)=Φ(x)+x⋅φ(x)where φ(x) stands for probability density function of Normal distribution.
Or
dxdGELU(x)=dxd[21x+21x⋅erf(x/2)]=21[1+erf(x/2)+x⋅π2e−x2/2]=21[1+erf(x/2)]+x⋅2π1e−x2/2=Φ(x)+x⋅φ(x) -
Tanh Approximation
By the property of erferf(x/2)≈tanh[π2(2x+12311(2x)3)]=tanh[π2(x+123112x3)]≈tanh[2/π(x+0.044715x3)]We we can obtain the following approximation
GELU(x)≈0.5x(1+tanh[2/π(x+0.044715x3)]) -
Derivative of Tanh Approximation
==dxd[0.5x(1+tanh[2/π(x+0.044715x3)])]0.5+0.5tanh[2/π(x+0.044715x3)]+0.5x(2/π(1+3×0.044715x2))(1−tanh2[2/π(x+0.044715x3)])0.5+0.5tanh[⋯]+2π1(x+3×0.044715x3)(1−tanh2[⋯]) -
Logistic Approximation
GELU(x)=xΦ(x)≈xσ(1.702x)=1+e−1.702xx -
Derivative of Logistic Approximation
dxd[1+e−1.702xx]=1+e−1.702x1+(1+e−1.702x)21.702xe−1.702x
Error function
-
Wiki link to Error Function
Error Function -
Definition of erf()
The related error function erf(x) gives the probability of a random variable, with normal distribution of mean 0 and variance 1/2 falling in the range [−x,x]. That is:erf(x)=∫−xxσ2π1e−21(σt−μ)2dtσ=1/2μ=0=π2∫0xe−t2dt -
Relationship between erf and Φ
Φ denotes cumulative distribution function for Standard Normal DistributionΦ(x)(s=2t)=2π1∫−∞xe2−t2dt=π1∫−∞x/2e−s2ds=π1∫−∞0e−s2ds+π1∫0x/2e−s2ds=21+21erf(2x) -
Derivative
dxderf(x)=dxd(π2∫0xe−t2dt)=π2e−x2 -
Numerical Approximation
Error_function#Approximation_with_elementary_functions | wikierf(x)≈tanh[π2(x+12311x3)]
-
How to find the tanh approximation for erf
The tanh approximation of the error function can be derived by using Taylor expansion of these two functions.tanh(x)taylorexpansionerf(x)taylorexpansion=n=1∑∞(2n)!22n(22n−1)B2nx2n−1,∣x∣<2π=x−3x3+152x5−31517x7+⋯=π2n=0∑∞n!(2n+1)(−1)nx2n+1=π2(x−3x3+10x5−42x7+216x9−⋯)Let’s say
z(x)=tanh(a0+a1x+a2x2+a3x3)≈erf(x)Firstly, erf is an odd function, so z(x) should also be an odd function to approxmate it. Therefore, we have a0=a2=0.
z(x)=tanh(a1x+a3x3)Use Taylor expansion to expand z(x)
tanh(a1x+a3x3)=a1x+−a3x331a13x3−+a12a3x5152a15x5−+−a1a32x7⋯⋯−31a33x9Compare the coefficients of the two expansions
{a1=π2a3−31a13=−3π2⇒{a1=π2a2=π23π(4−π)We get this formula
z(x)=tanh[π2(x+35πx3)]which keeps the amplitude differences 1−tanh(x)erf(x)<0.0005.
However, the thing is not ove yet. You must have noticed that the a2 is π212311 on the wiki rather than π23π(4−π) as we calculated. The answer is here:
How to minimize the maximum absolute difference between 2 functions?
-
Derivation for erf’s Taylor series
dxderf(x)=π2e−x2taylorexpansion=π2n=0∑∞n!(−1)nx2nex=∑n=0∞n!xn, for x∈R
Take the antiderivative of both sides with respect to x:
erf(x)uniform convergence=π2∫n=0∑∞n!(−1)nx2n=π2n=0∑∞∫n!(−1)nx2n=π2n=0∑∞n!(2n+1)(−1)nx2n+1+cn=π2(x−3x3+10x5−42x7+216x9−⋯)+n=0∑∞cnlet x=0, we conclude that the constant term ∑n=0∞cn should be 0. And we get the Taylor expansion of error function as follow:
erf(x)taylorexpansion=π2n=0∑∞n!(2n+1)(−1)nx2n+1=π2(x−3x3+10x5−42x7+216x9−⋯)
Logistic distribution
-
Wiki to Logistic distribution
Logistic distribution -
Probability density function
f(x;μ,s)=s(1+e−sx−μ)2e−sx−μ -
Cumulative distribution function
F(x;μ,s)=1+e−sx−μ1 -
Logistic function
When μ=0 and s=1, we get the logistic function that are often used in machine learningf(x)=1+e−x1 -
Approximation for Φ(x)
The shapes of CDF for logistic and normal distribution are similar——both are bell shaped. We can use the logistic CDF, which has a simpler formula for calculation, to approximate the CDF Φ(x) for standard normal distribution.
Obviously, their μ should be same, both zero. Thus, there are only one parameter left for logistic CDF to determine——the s. Let γ=s1, we try to find the gamma that minimizes the difference between the two CDFsargγminxmax2π1∫−∞xe2−t2dt−1+e−γx1Using the generalized reduced gradient algorithm, the parameter γ is determined by minimizing the maximum deviation between the normal and logistic CDFs. The maximum deviation of 0.0095 occurs at x=±0.57 for γ=1.702. Therefore, the following equation gives the best logistic fit for the CDF of standard normal distribution Φ(x):
Φ(x)≈F(x)=1+e−1.702x1