Self-Supervised learning

Self-Supervised Representation Learning | Lil’Log

self-supervised task is also known as pretext task
Broadly speaking, all the generative models can be considered as self-supervised, but with different goals: Generative models focus on creating diverse and realistic images, while self-supervised representation learning care about producing good features generally helpful for many tasks.

RGB-IR

通常 remote sensing image (RSI) 图像有三种颜色模式

  • IRRG: 3 channels (IR-R-G)
  • RGB: 3 channels (R-G-B)
  • RGBIR: 4 channels (R-G-B-IR)

其中 RGB 即 Red Green Blue, 而 IR 指 InfraRed (红外)

数据集

ISPRS, International Society for Photogrammetry and Remote Sensing, 国际摄影测量与遥感学会, 是一个以推动国际摄影测量与遥感的发展、应用、合作为宗旨的非政府组织.
其发起的 ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling 项目中包括两个遥感图像分割领域的知名数据集, Potsdam 和 Vaihingen 数据集.

Note: ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling 中附有数据集的下载链接.

Potsdam

Vaihingen

Note: Vaihingen 数据集一共 33 个 patch, 缺少标号 9 18 19 25 36 的 patch.

关于 GLCNet

global style and local matching contrastive learning network (GLCNet)

  • Global style contrastive learning module
  • Local matching contrastive learning module

网络结构

Input: xix_i

Data augmentation:

x~i=t1(xi),x^i=t2(xi)\tilde{x}_i = t_1(x_i),\quad \hat{x}_i = t_2(x_i)

t1()t_1(\cdot) represents random cropping followed by resizing to a fixed resolution (e.g. 224×224224 \times 224)
t2()t_2(\cdot) represents sequentially applying several augmentations: random cropping followed by resizing to a fixed resolution, random flipping, random rotating, random color distortion, random Gaussian blur, random noise, and random gray.

Feature extraction:

f~i=μ(e(x~i)),f^i=μ(e(x^i))\tilde{f}_i = \mu(e(\tilde{x}_i)),\quad \hat{f}_i = \mu(e(\hat{x}_i))

e()e(\cdot) is the encoder of the semantic segmentation network DeepLabV3+
μ()\mu(\cdot) represents the calculation of the mean value of each channel in the feature map (i.e. the global average pooling).

Encoder

Projection head:

z~i=g(f~i)=W~(2)ReLU(W~(1)f~i)z^i=g(f^i)=W^(2)ReLU(W^(1)f^i)\tilde{z}_i = g(\tilde{f}_i) = \tilde{W}^{(2)}\cdot{\rm ReLU}\left(\tilde{W}^{(1)}\cdot\tilde{f}_i\right)\\ \hat{z}_i = g(\hat{f}_i) = \hat{W}^{(2)}\cdot{\rm ReLU}\left(\hat{W}^{(1)}\cdot\hat{f}_i\right)

存疑: W^=?W~\hat{W}=?\tilde{W}

Projection head g()g(·) is an MLP with one hidden layer (with ReLU).

The presence of g()g(·) in SimCLR has been proven to be very beneficial, possibly because it allows the e()e(·) to form and retain more potentially useful information for downstream tasks.

Contrastive loss:

Lc=12Nk=1N(l(x~i,x^i)+l(x^i,x~i))L_c = \frac{1}{2N}\sum_{k=1}^N(l(\tilde{x}_i,\hat{x}_i)+l(\hat{x}_i,\tilde{x}_i))

l(x~i,x^i)=logexp(sim(z~i,z^i)/τ)xΛexp(sim(z~i,g(f(x)))/τ)l(\tilde{x}_i, \hat{x}_i) = -\log\frac{\exp(sim(\tilde{z}_i, \hat{z}_i)/\tau)}{\sum_{x\in\Lambda^-} \exp(sim(\tilde{z}_i, \underline{g(f(x))})/\tau)}

sim()sim(\cdot) denotes the similarity measure function between two feature vectors and, in this work, is the cosine similarity.
Λ\Lambda^- denotes 2(N1)2(N − 1) negative samples in addition to the positive sample pair.
τ\tau denotes a temperature parameter.

GLCNet:

Global style contrastive learning module
we calculate the ==channel-wise mean== and ==variance of the features extracted by the encoder e()e(·)== to extract the global style feature vector.

fs(xi)=concat(μ(e(xi)),σ(e(xi)))f^s(x_i) = \text{concat}(\mu(e(x_i)), \sigma(e(x_i)))

μ(e(xi))\mu(e(x_i)): channel-wise mean of the feature map.
σ(e(xi))\sigma(e(x_i)): variance of the features extracted by the encoder e()e(·)

Global style contrastive learning loss is defined as follows

LG=12Nk=1N(lg(x~i,x^i)+lg(x^i,x~i))L_G = \frac{1}{2N}\sum_{k=1}^N(l_g(\tilde{x}_i,\hat{x}_i)+l_g(\hat{x}_i,\tilde{x}_i))

lg(x~i,x^i)=logexp(sim(z~is,z^is)/τ)xΛexp(sim(z~is,g(fs(x)))/τ)l_g(\tilde{x}_i, \hat{x}_i) = -\log\frac{\exp(sim(\tilde{z}_i^s, \hat{z}_i^s)/\tau)}{\sum_{x\in\Lambda^-} \exp(sim(\tilde{z}_i^s, \underline{g(f_s(x))})/\tau)}

where z~is=g(fs(x~i)),z^is=g(fs(x^i))\tilde{z}_i^s = g(f_s(\tilde{x}_i)),\quad \hat{z}_i^s = g(f_s(\hat{x}_i))

Local matching contrastive learning module
First, the land cover categories in a single image in the semantic segmentation dataset are extremely rich. Extracting only the global features of the whole image to measure and distinguish images will result in the loss of much information;
Second, instance-wise contrastive learning methods are used to obtain image-level features that may be suboptimal for semantic segmentation requiring pixellevel discrimination.
Therefore, the local matching contrastive learning module is designed to learn the representation of local regions, which is beneficial for pixel-level semantic segmentation.

Local region selection and matching
we record the pixel position by introducing an index label to ensure that the center positions of the two matching local regions correspond to each other in the original image.

Local matching feature extraction:

f~Lj=fL(p~j)=μ(p~j)p~jd(e(x~))DeepLabV3+Decodere(x~)DeepLabV3+Encoderx~\tilde{f}_L^j = f_L(\tilde{p}_j) = \mu(\tilde{p}_j) \leftarrow \tilde{p}_j \leftarrow d(e(\tilde{x})) \xleftarrow[\text{DeepLabV3+}]{\text{Decoder}} e(\tilde{x}) \xleftarrow[\text{DeepLabV3+}]{\text{Encoder}} \tilde{x}

f^Lj=fL(p^j)=μ(p^j)p^jd(e(x^))DeepLabV3+Decodere(x^)DeepLabV3+Encoderx^\hat{f}_L^j = f_L(\hat{p}_j) = \mu(\hat{p}_j) \leftarrow \hat{p}_j \leftarrow d(e(\hat{x})) \xleftarrow[\text{DeepLabV3+}]{\text{Decoder}} e(\hat{x}) \xleftarrow[\text{DeepLabV3+}]{\text{Encoder}}\hat{x}

μ()\mu(\cdot) represents the calculation of the mean value of each channel in the feature map

Local matching contrastive loss:

LL=12NLk=1NL(lL(p~j,p^j)+lL(p^j,p~j))L_L = \frac{1}{2N_L}\sum_{k=1}^{N_L}(l_L(\tilde{p}_j,\hat{p}_j)+l_L(\hat{p}_j,\tilde{p}_j))

lL(p~j,p^j)=logexp(sim(μ~j,μ^j)/τ)pΛLexp(sim(μ~j,gL(fL(p)))/τ)l_L(\tilde{p}_j, \hat{p}_j) = -\log\frac{\exp(sim(\tilde{\mu}_j, \hat{\mu}_j)/\tau)}{\sum_{p\in\Lambda_L^-} \exp(sim(\tilde{\mu}_j, \underline{g_L(f_L(p))})/\tau)}

μ~j=gL(f~Lj)=gL(fL(p~j)),μ^j=gL(f^Lj)=gL(fL(p^j))\tilde{\mu}_j = g_L(\tilde{f}_L^j) = g_L(f_L(\tilde{p}_j)),\quad \hat{\mu}_j = g_L(\hat{f}_L^j) = g_L(f_L(\hat{p}_j))

NLN_L denotes the number of all local regions selected from a mini-batch of NN samples, i.e. NL=N×npN_L = N × n_p, where npn_p is the number of matched local regions obtained from a sample.
ΛLΛ_L^− is a set of feature maps corresponding to all local regions except the two matched local regions.
gL()g_L(·) is a projection head that is similar to g()g(·).

Total loss:

L=λLG+(1λ)LLL = \lambda \cdot L_G + (1-\lambda)L_L

λ=0.5\lambda = 0.5 in this paper

Loss Funcitons

Cross Entropy Loss

  • Wiki to Cross Entropy
    Cross Entropy

  • Pytorch docs to CrossEntropyLoss
    torch.nn.CrossEntropyLoss

    LCE=h,wclog(Pc,h,w)IGh,w=c=TP+TN\mathscr{L}_{CE} = -\sum_{h,w}\sum_{c}\log(P_{c,h,w}) \cdot I_{G_{h,w}=c} = TP + TN

    Where Pc,h,w[0,1]P_{c,h,w}\in[0,1], Gh,w{0,,C1}G_{h,w}\in\{0,\cdots,C-1\}

Binary and Muti-Class Cross Entropy Loss

For a binary classification/segmentation problem, for example foreground and background in segmentation, the binary cross entropy loss (LBCE\mathscr{L}_{BCE}) is defined as the following:

LBCE(g,p)=(glog(p))+(1g)log(1p)\mathscr{L}_{BCE}(g,p) = -(g\log(p)) + (1-g)\log(1-p)

Here, g{0,1},p[0,1]g\in\{0,1\},p\in[0,1], where gg refers to the ground truth label of sample and pp refers to the predicted value of the pixel(or sample) belongs to label gg. Besides, this formula is merely for calculation in respect to one pixel.

For an output prediction mask with shape H×W and a ground truth mask with shape H×W, the binary cross entropy loss LBCE\mathscr{L}_{BCE} is computed as:

LBCE(g,p)=1H×Wh=0H1w=0W1gh,wlog(ph,w)+(1gh,w)log(1ph,w)\mathscr{L}_{BCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}-g_{h,w}\log(p_{h,w}) + (1-g_{h,w})\log(1-p_{h,w})

Here, gh,w{0,1},ph,w[0,1]g_{h,w}\in\{0,1\},p_{h,w}\in[0,1], where gh,wg_{h,w} refers to the ground truth label of pixel (h,w) and ph,wp_{h,w} refers to the predicted value of the pixel (h,w) belonging to label gh,wg_{h,w}.

This can be extended to multi-class problems, and the categorical cross entropy loss (LCCE) is computed as:

LCCE(g,p)=1H×Wh=0H1w=0W1c=0C1gc,h,wlog(pc,h,w)\mathscr{L}_{CCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\sum_{c=0}^{C-1}-g_{c,h,w}\log(p_{c,h,w})

where gc,h,wg_{c,h,w} uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and pc,h,w[0,1]p_{c,h,w}\in[0,1] is the predicted value of the pixel (h,w) belonging to label c.

gc,h,w=(00,,0c1,1c,0c+1,,0C1)g_{c,h,w} = (0_0,\cdots,0_{c-1},1_{c},0_{c+1},\cdots,0_{C-1})

The formula can be rewrited as:

LCCE(g,p)=1H×Wh=0H1w=0W1log(pgh,w,h,w)\mathscr{L}_{CCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}-\log(p_{\underline{\color{red}g_{h,w}},h,w})

where gg is a matrix with shape H×W, whose element gh,w{0,1,,C1}g_{h,w}\in\{0,1,\cdots,C-1\} represents the ground truth label of pixel (h,w), and pp is a matrix with shape C×H×W, whose element pc,h,w[0,1]p_{c,h,w}\in[0,1] is the predicted value of pixel (h,w) belonging to label c.

The rewritten formula makes it easier to write code

CrossEntropy \Leftrightarrow Accuracy

Why not just use Accuracy as Loss function?

LCCE(g,p)=1H×Wh=0H1w=0W1log(pgh,w,h,w)LCCE(g,p)=1H×Wh=0H1w=0W1log(TPgh,w,h,w)\begin{gather*} \mathscr{L}_{CCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}-\log(\fbox{$p_{\textcolor{red}{g_{h,w}},h,w}$})\\ \Downarrow\\ \mathscr{L}_{CCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}-\log(\fbox{$TP_{\textcolor{red}{g_{h,w}},h,w}$}) \end{gather*}

Accuracy=1H×WcTPc=1H×Wh=0H1w=0W1TPgh,w,h,w\text{Accuracy} = \frac{1}{H\times W}\sum_{c} TP_c = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\fbox{$\lceil TP_{\textcolor{red}{g_{h,w}},h,w} \rceil$}

where \lceil\cdot\rceil is the round up operator and TPgh,w,h,w{0,1}\lceil TP_{\textcolor{red}{g_{h,w}},h,w} \rceil \in \{0,1\}

TPgh,w,h,w\lceil TP_{\textcolor{red}{g_{h,w}},h,w} \rceil will round up to 1 if TPgh,w,h,wTP_{\textcolor{red}{g_{h,w}},h,w} is the max value among TPh,wTP_{h,w}

Focal loss

  • Original Paper
    2017-ICCV: Focal Loss for Dense Object Detection

  • Description
    The Focal loss is a variant of the binary cross entropy loss that addresses the issue of class imbalance with the standard cross entropy loss by down-weighting the contribution of easy examples enabling learning of harder examples.

    The Focal loss LF\mathscr{L}_{F} adds a modulating factor to the cross entropy loss:

    LF(g,p)=1H×Wh=0H1w=0W1αgh,w(1pgh,w,h,w)γlog(pgh,w,h,w)\mathscr{L}_{F}(g,p) = -\frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\alpha_{\textcolor{red}{g_{h,w}}}(1-p_{\textcolor{red}{g_{h,w}},h,w})^\gamma\cdot\log(p_{\textcolor{red}{g_{h,w}},h,w})

    where γα(1p)γ\gamma\uparrow \Rightarrow \alpha(1-p)^\gamma\downarrow and pα(1p)γp\uparrow \Rightarrow \alpha(1-p)^\gamma\downarrow

Dice Loss

The Sørensen-Dice index, known as the Dice similarity coefficient (DSC) when applied to Boolean data, is the most commonly used metric for evaluating segmentation accuracy. We can define DSC in terms of the per voxel classification of true positives (TP), false positives (FP) and false negatives (FN):

DSCc=2TPc2TPc+FPc+FNcDSC_c = \frac{2TP_c}{2TP_c+FP_c+FN_c}

The Dice loss LDSC\mathscr{L}_{DSC} can therefore be defined as:

LDSC=c=0C11DSCc\mathscr{L}_{DSC} = \sum_{c=0}^{C-1} 1 - DSC_c

The Dice loss is somewhat adapted to handle class imbalance. However, the Dice loss gradient is inherently unstable, most evident with highly class imbalanced data where gradient calculations involve small denominators.

DiceLoss \Leftrightarrow F1 score

DSC=2TP2TP+FP+FN=2TP+FPTP+TP+FNTP=21Precision+1Recall=F1DSC = \frac{2TP}{2TP+FP+FN} = \frac{2}{\frac{TP+FP}{TP} + \frac{TP+FN}{TP}} = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} = F_1

LDSC=1DSC=1F1\mathscr{L}_{DSC} = 1 - DSC = 1 - F_1

Tversky loss

  • Original Paper
    2017 Tversky loss function for image segmentation using 3D fully convolutional deep networks

  • Description
    The Tversky Index is closely related to the DSC, but enables optimisation for output imbalance by assigning weights α\alpha and β\beta to false positives and false negatives, respectively:

    TIc=TPcTPc+αFPc+βFNc=h,wpc,h,wgc,h,wh,wpc,h,wgc,h,w+αh,wpc,h,w(icgi,h,w)+βh,w(icpi,h,w)gc,h,w=h,wpc,h,wgc,h,wh,wpc,h,wgc,h,w+αh,wpc,h,w(1gc,h,w)+βh,w(1pc,h,w)gc,h,w\begin{split} TI_c &= \frac{TP_c}{TP_c + \alpha FP_c + \beta FN_c}\\ &= \frac{\sum_{h,w}p_{c,h,w}g_{c,h,w}}{\sum_{h,w}p_{c,h,w}g_{c,h,w} + \alpha\sum_{h,w}p_{c,h,w}(\sum_{i\neq c}g_{i,h,w}) + \beta\sum_{h,w}(\sum_{i\neq c}p_{i,h,w})g_{c,h,w}}\\ &= \frac{\sum_{h,w}p_{c,h,w}g_{c,h,w}}{\sum_{h,w}p_{c,h,w}g_{c,h,w} + \alpha\sum_{h,w}p_{c,h,w}(1-g_{c,h,w}) + \beta\sum_{h,w}(1-p_{c,h,w})g_{c,h,w}} \end{split}

    where gc,h,wg_{c,h,w} uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and pc,h,w[0,1]p_{c,h,w}\in[0,1] is the predicted value of the pixel (h,w) belonging to label c.

    gc,h,w=(00,,0c1,1c,0c+1,,0C1)g_{c,h,w} = (0_0,\cdots,0_{c-1},1_{c},0_{c+1},\cdots,0_{C-1})

    Using the Tversky Index, we define the Tversky loss LT\mathscr{L}_{T} for a segmentation task with C catatories as:

    LT=c=0C1(1TIc)\mathscr{L}_{T} = \sum_{c=0}^{C-1}(1-TI_c)

    When the Dice loss function is applied to class imbalanced problems, the resulting segmentation often exhibits high precision but low recall scores. Assigning a greater weight to false negatives improves recall and results in a better balance of precision and recall.Therefore, β\beta is often set higher than α\alpha, most commonly β=0.7\beta = 0.7 and α=0.3\alpha = 0.3.

Focal Tversky loss

  • Original Paper
    2018-10 A novel focal Tversky loss function with improved attention U-Net for lesion segmentation

  • Description
    Using the definition of TI, the Focal Tversky loss LFT\mathscr{L}_{FT} is defined as:

    LFT=c=0C1(1TIc)1γ\mathscr{L}_{FT} = \sum_{c=0}^{C-1}(1-TI_c)^{\frac{1}{\gamma}}

    where γ<1\gamma < 1 increases the degree of focusing on harder examples. The Focal Tversky loss simplifies to the Tversky loss when γ=1\gamma = 1.
    The optimal value reported was γ=43\gamma = 4∕3, which enhances rather than suppresses the loss of easy examples.

Combo loss

  • Original Paper
    2018-05 Combo loss: handling input and output imbalance in multiorgan segmentation.

  • Description
    The Combo loss Lcombo\mathscr{L}_{combo} is defined as a weighted sum of the DSC and a modified form of the cross entropy loss LmCE\mathscr{L}_{mCE}:

    Lcombo=αLmCE(1α)DSC\mathscr{L}_{combo} = \alpha\mathscr{L}_{mCE} - (1-\alpha) \cdot DSC

    where:

    LmCE=1H×Wh=0H1w=0W1βgh,wlog(pgh,w,h,w)\mathscr{L}_{mCE} = -\frac{1}{H \times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1} \beta_{\color{red}g_{h,w}}\cdot\log(p_{\underline{\color{red}g_{h,w}},h,w})

    and α[0,1]\alpha \in [0,1] controls the relative contribution of the Dice and Cross Entropy terms to the loss, and β0,,βC1\beta_0,\cdots,\beta_{C-1} controls the relative weights assigned to the True Positives of every catagory. A larger proportion of the value of βc\beta_c leads to a larger weight of TP and a larger penalty of FN for the corresponding category c.

    βc\beta_c 相对与其它类别β\beta的值越大,对应c类的TP权重就越大,相应的对c类FN“惩罚”的权重就越大。
    因为βclog(pc)-\beta_c\cdot\log(p_{c})是关于pcp_c递减的,pcp_c越大loss就越小。βc\beta_c越大,log(pc)-\log(p_c)的权重就越大,loss会偏向于指导模型在g=c的像素上输出更大的pcp_c(TP),相应的输出更小的pcp_{\cancel c}(FN). pcp_{\cancel c}指除了c类的其它类别的预测概率。

DiceFocal loss

Hybrid Focal loss

Unified Focal loss

  • Original Paper
    2021-02 Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation

  • Description
    Firstly, the authers replaced α\alpha in the Focal loss and α\alpha and β\beta in the Tversky Index with a common δ\delta parameter to control output imbalance, and reformulate γ\gamma to enable simultaneous Focal loss suppression and Focal Tversky loss enhancement, naming these the modified Focal loss LmF\mathscr{L}_{mF} and modified Focal Tversky loss LmFT\mathscr{L}_{mFT}, respectively:

    LmF=1H×Wh=0H1w=0W1δgh,w(1pgh,w,h,w)1γlog(pgh,w,h,w)LmFT=c=0C1(1mTIc)γ\begin{split} \mathscr{L}_{mF} &= -\frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\delta_{\textcolor{red}{g_{h,w}}}(1-p_{\textcolor{red}{g_{h,w}},h,w})^{1-\gamma}\cdot\log(p_{\textcolor{red}{g_{h,w}},h,w}) \\ \mathscr{L}_{mFT} &= \sum_{c=0}^{C-1}(1-\text{mTI}_c)^{\gamma} \end{split}

    where mTIc\text{mTI}_c is redefined as following:

    mTIc=TPcTPc+δFPc+(1δ)FNcTPc=h,wpc,h,wgc,h,w,FPc=h,wpc,h,w(1gc,h,w),FNc=h,w(1pc,h,w)gc,h,w\begin{gather*} \text{mTI}_c = \frac{TP_c}{TP_c + \delta\cdot FP_c + (1-\delta)\cdot FN_c} \\ \footnotesize TP_c = \sum_{h,w}p_{c,h,w}g_{c,h,w},\quad FP_c = \sum_{h,w}p_{c,h,w}(1-g_{c,h,w}),\quad FN_c = \sum_{h,w}(1-p_{c,h,w})g_{c,h,w} \end{gather*}

    where gc,h,wg_{c,h,w} uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and pc,h,w[0,1]p_{c,h,w}\in[0,1] is the predicted value of the pixel (h,w) belonging to label c.

    gc,h,w=(00,,0c1,1c,0c+1,,0C1)g_{c,h,w} = (0_0,\cdots,0_{c-1},1_{c},0_{c+1},\cdots,0_{C-1})


    The symmetric variant of the Unified Focal loss LsUF\mathscr{L}_{sUF} is defined as:

    LsUF=λLmF+(1λ)LmFT\mathscr{L}_{sUF} = \lambda\mathscr{L}_{mF} + (1-\lambda)\mathscr{L}_{mFT}

    where λ[0,1]\lambda\in[0,1] determining the relative weighting of the two losses.
    By grouping functionally equivalent hyperparameters, the six hyperparameters associated with the Hybrid Focal loss are reduced to three, with

    • δ\delta controlling the relative weighting of positive and negative examples,
    • γ\gamma controlling both suppression of the background class and enhancement of the rare class,
    • λ\lambda determining the weights of the two component losses.

SimCLR

A simple framework for contrastive learning of visual representations.

{x1,x2,,xN}mini batchtτtτ{x~1,x~2,x~N,x~N+1,x~N+2,x~2N}ResNet+MLPf+g{z1,z2,zN,zN+1,zN+2,z2N} \begin{split} \underset{\text{mini batch}}{\{x_1, x_2, \cdots, x_N\}} \xrightarrow[t'\sim\tau]{t\sim\tau}& \left\{ \begin{array}{llll} \tilde{x}_1, &\tilde{x}_2, &\cdots &\tilde{x}_N,\\ \tilde{x}_{N+1}, &\tilde{x}_{N+2}, &\cdots &\tilde{x}_{2N} \end{array} \right\} \\ \xrightarrow[\text{ResNet+MLP}]{f+g}& \left\{ \begin{array}{llll} z_1, &z_2, &\cdots &z_N,\\ z_{N+1}, &z_{N+2}, &\cdots &z_{2N} \end{array} \right\}\ \end{split}

Given a positive pair (i,N+i)(i, N+i), we treat the other 2(N1)2(N−1) augmented examples within a minibatch as negative examples.

li,N+i=logexp(sim(zi,zN+i)/τ)k=1,ki2Nexp(sim(zi,zk)/τ)=logm(zi,zN+i)m(zi,z1)++0++m(zi,zN)+m(zi,zN+1)++m(zi,zN+i)++m(zi,z2N)\begin{split} l_{i, N+i} &=-\log\frac{exp({\rm sim}(z_i,z_{N+i})/\tau)} {\sum_{k=1,k\neq i}^{2N} exp({\rm sim}(z_i, z_k)/\tau)} \\ &= -\log\frac{m(z_i, z_{N+i})} {\begin{array}{lcccl} m(z_i,z_1) & + \cdots + & 0 & + \cdots + & m(z_i,z_N)+\\ m(z_i,z_{N+1}) & + \cdots + & m(z_i,z_{N+i}) & + \cdots + & m(z_i,z_{2N}) \end{array}} \end{split}

m(zi,zj)=exp(sim(zi,zj)/τ),j=1,,2Nm(z_i,z_j) = exp({\rm sim}(z_i,z_j)/\tau),\quad j=1,\cdots,2N

cosine similarity:

sim(zi,zj)=zizjzizj=cos(<zi,zj>){\rm sim}(z_i, z_j) = \frac{z_i \cdot z_j}{|z_i||z_j|} = \cos(<z_i,z_j>)

Note: m(zi,zj)=m(zj,zi)m(z_i, z_j) = m(z_j, z_i), but li,jlj,il_{i, j} \neq l_{j,i}.
They have the same numerator, but their denominators are quite different.

lN+i,i=logexp(sim(zN+i,zi)/τ)k=1,kN+i2Nexp(sim(zN+i,zk)/τ)=logm(zN+i,zi)m(zN+i,z1)++m(zN+i,zi)++m(zN+i,zN)+m(zN+i,zN+1)++0++m(zN+i,z2N)\begin{split} l_{N+i,i} &=-\log\frac{exp({\rm sim}(z_{N+i},z_{i})/\tau)} {\sum_{k=1,k\neq N+i}^{2N} exp({\rm sim}(z_{N+i}, z_k)/\tau)} \\ &= -\log\frac{m(z_{N+i}, z_{i})} {\begin{array}{lcccl} m(z_{N+i},z_1) & + \cdots + & m(z_{N+i},z_{i}) & + \cdots + & m(z_{N+i},z_N)+\\ m(z_{N+i},z_{N+1}) & + \cdots + & 0 & + \cdots + & m(z_{N+i},z_{2N}) \end{array}} \end{split}

The final loss is computed across all positive pairs, both (i,j)(i, j) and (j,i)(j, i), in a mini-batch.

L=12Ni=1N(li,N+i+lN+i,i)L =\frac{1}{2N}\sum_{i=1}^N (l_{i, N+i} + l_{N+i, i})

Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google for neural network machine learning, using Google’s own TensorFlow software. Google began using TPUs internally in 2015, and in 2018 made them available for third party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.