Semantic Segmentation

Self-Supervised learning

Self-Supervised Representation Learning | Lil’Log

self-supervised task is also known as pretext task
Broadly speaking, all the generative models can be considered as self-supervised, but with different goals: Generative models focus on creating diverse and realistic images, while self-supervised representation learning care about producing good features generally helpful for many tasks.

RGB-IR

通常 remote sensing image (RSI) 图像有三种颜色模式

IRRG: 3 channels (IR-R-G)
RGB: 3 channels (R-G-B)
RGBIR: 4 channels (R-G-B-IR)

其中 RGB 即 Red Green Blue, 而 IR 指 InfraRed (红外)

数据集

ISPRS, International Society for Photogrammetry and Remote Sensing, 国际摄影测量与遥感学会, 是一个以推动国际摄影测量与遥感的发展、应用、合作为宗旨的非政府组织.
其发起的 ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling 项目中包括两个遥感图像分割领域的知名数据集, Potsdam 和 Vaihingen 数据集.

Note: ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling 中附有数据集的下载链接.

Potsdam: 2D Semantic Labeling Contest - Potsdam

Potsdam

Vaihingen: 2D Semantic Labeling - Vaihingen data

Vaihingen

Note: Vaihingen 数据集一共 33 个 patch, 缺少标号 9 18 19 25 36 的 patch.

关于 GLCNet

global style and local matching contrastive learning network (GLCNet)

Global style contrastive learning module
Local matching contrastive learning module

网络结构

Input: $x_i$

Data augmentation:

$\tilde{x}_i = t_1(x_i),\quad \hat{x}_i = t_2(x_i)$

$t_1(\cdot)$ represents random cropping followed by resizing to a fixed resolution (e.g. $224 \times 224$ )
$t_2(\cdot)$ represents sequentially applying several augmentations: random cropping followed by resizing to a fixed resolution, random flipping, random rotating, random color distortion, random Gaussian blur, random noise, and random gray.

Feature extraction:

$\tilde{f}_i = \mu(e(\tilde{x}_i)),\quad \hat{f}_i = \mu(e(\hat{x}_i))$

$e(\cdot)$ is the encoder of the semantic segmentation network DeepLabV3+
$\mu(\cdot)$ represents the calculation of the mean value of each channel in the feature map (i.e. the global average pooling).

Encoder

Projection head:

$\tilde{z}_i = g(\tilde{f}_i) = \tilde{W}^{(2)}\cdot{\rm ReLU}\left(\tilde{W}^{(1)}\cdot\tilde{f}_i\right)\\ \hat{z}_i = g(\hat{f}_i) = \hat{W}^{(2)}\cdot{\rm ReLU}\left(\hat{W}^{(1)}\cdot\hat{f}_i\right)$

存疑： $\hat{W}=?\tilde{W}$

Projection head $g(·)$ is an MLP with one hidden layer (with ReLU).

The presence of $g(·)$ in SimCLR has been proven to be very beneficial, possibly because it allows the $e(·)$ to form and retain more potentially useful information for downstream tasks.

Contrastive loss:

$L_c = \frac{1}{2N}\sum_{k=1}^N(l(\tilde{x}_i,\hat{x}_i)+l(\hat{x}_i,\tilde{x}_i))$

$l(\tilde{x}_i, \hat{x}_i) = -\log\frac{\exp(sim(\tilde{z}_i, \hat{z}_i)/\tau)}{\sum_{x\in\Lambda^-} \exp(sim(\tilde{z}_i, \underline{g(f(x))})/\tau)}$

$sim(\cdot)$ denotes the similarity measure function between two feature vectors and, in this work, is the cosine similarity.
$\Lambda^-$ denotes $2(N − 1)$ negative samples in addition to the positive sample pair.
$\tau$ denotes a temperature parameter.

GLCNet:

Global style contrastive learning module
we calculate the ==channel-wise mean== and ==variance of the features extracted by the encoder $e(·)$ == to extract the global style feature vector.

$f^s(x_i) = \text{concat}(\mu(e(x_i)), \sigma(e(x_i)))$

$\mu(e(x_i))$ : channel-wise mean of the feature map.
$\sigma(e(x_i))$ : variance of the features extracted by the encoder $e(·)$

Global style contrastive learning loss is defined as follows

$L_G = \frac{1}{2N}\sum_{k=1}^N(l_g(\tilde{x}_i,\hat{x}_i)+l_g(\hat{x}_i,\tilde{x}_i))$

$l_g(\tilde{x}_i, \hat{x}_i) = -\log\frac{\exp(sim(\tilde{z}_i^s, \hat{z}_i^s)/\tau)}{\sum_{x\in\Lambda^-} \exp(sim(\tilde{z}_i^s, \underline{g(f_s(x))})/\tau)}$

where $\tilde{z}_i^s = g(f_s(\tilde{x}_i)),\quad \hat{z}_i^s = g(f_s(\hat{x}_i))$

Local matching contrastive learning module
First, the land cover categories in a single image in the semantic segmentation dataset are extremely rich. Extracting only the global features of the whole image to measure and distinguish images will result in the loss of much information;
Second, instance-wise contrastive learning methods are used to obtain image-level features that may be suboptimal for semantic segmentation requiring pixellevel discrimination.
Therefore, the local matching contrastive learning module is designed to learn the representation of local regions, which is beneficial for pixel-level semantic segmentation.

Local region selection and matching
we record the pixel position by introducing an index label to ensure that the center positions of the two matching local regions correspond to each other in the original image.

Local matching feature extraction:

$\tilde{f}_L^j = f_L(\tilde{p}_j) = \mu(\tilde{p}_j) \leftarrow \tilde{p}_j \leftarrow d(e(\tilde{x})) \xleftarrow[\text{DeepLabV3+}]{\text{Decoder}} e(\tilde{x}) \xleftarrow[\text{DeepLabV3+}]{\text{Encoder}} \tilde{x}$

$\hat{f}_L^j = f_L(\hat{p}_j) = \mu(\hat{p}_j) \leftarrow \hat{p}_j \leftarrow d(e(\hat{x})) \xleftarrow[\text{DeepLabV3+}]{\text{Decoder}} e(\hat{x}) \xleftarrow[\text{DeepLabV3+}]{\text{Encoder}}\hat{x}$

$\mu(\cdot)$ represents the calculation of the mean value of each channel in the feature map

Local matching contrastive loss:

$L_L = \frac{1}{2N_L}\sum_{k=1}^{N_L}(l_L(\tilde{p}_j,\hat{p}_j)+l_L(\hat{p}_j,\tilde{p}_j))$

$l_L(\tilde{p}_j, \hat{p}_j) = -\log\frac{\exp(sim(\tilde{\mu}_j, \hat{\mu}_j)/\tau)}{\sum_{p\in\Lambda_L^-} \exp(sim(\tilde{\mu}_j, \underline{g_L(f_L(p))})/\tau)}$

$\tilde{\mu}_j = g_L(\tilde{f}_L^j) = g_L(f_L(\tilde{p}_j)),\quad \hat{\mu}_j = g_L(\hat{f}_L^j) = g_L(f_L(\hat{p}_j))$

$N_L$ denotes the number of all local regions selected from a mini-batch of $N$ samples, i.e. $N_L = N × n_p$ , where $n_p$ is the number of matched local regions obtained from a sample.
$Λ_L^−$ is a set of feature maps corresponding to all local regions except the two matched local regions.
$g_L(·)$ is a projection head that is similar to $g(·)$ .

Total loss:

$L = \lambda \cdot L_G + (1-\lambda)L_L$

$\lambda = 0.5$ in this paper

Loss Funcitons

Cross Entropy Loss

Wiki to Cross Entropy
Cross Entropy
Pytorch docs to CrossEntropyLoss
torch.nn.CrossEntropyLoss

$\mathscr{L}_{CE} = -\sum_{h,w}\sum_{c}\log(P_{c,h,w}) \cdot I_{G_{h,w}=c} = TP + TN$

Where $P_{c,h,w}\in[0,1]$ , $G_{h,w}\in\{0,\cdots,C-1\}$

Binary and Muti-Class Cross Entropy Loss

For a binary classification/segmentation problem, for example foreground and background in segmentation, the binary cross entropy loss ( $\mathscr{L}_{BCE}$ ) is defined as the following:

$\mathscr{L}_{BCE}(g,p) = -(g\log(p)) + (1-g)\log(1-p)$

Here, $g\in\{0,1\},p\in[0,1]$ , where $g$ refers to the ground truth label of sample and $p$ refers to the predicted value of the pixel(or sample) belongs to label $g$ . Besides, this formula is merely for calculation in respect to one pixel.

For an output prediction mask with shape H×W and a ground truth mask with shape H×W, the binary cross entropy loss $\mathscr{L}_{BCE}$ is computed as:

$\mathscr{L}_{BCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}-g_{h,w}\log(p_{h,w}) + (1-g_{h,w})\log(1-p_{h,w})$

Here, $g_{h,w}\in\{0,1\},p_{h,w}\in[0,1]$ , where $g_{h,w}$ refers to the ground truth label of pixel (h,w) and $p_{h,w}$ refers to the predicted value of the pixel (h,w) belonging to label $g_{h,w}$ .

This can be extended to multi-class problems, and the categorical cross entropy loss (LCCE) is computed as:

$\mathscr{L}_{CCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\sum_{c=0}^{C-1}-g_{c,h,w}\log(p_{c,h,w})$

where $g_{c,h,w}$ uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and $p_{c,h,w}\in[0,1]$ is the predicted value of the pixel (h,w) belonging to label c.

$g_{c,h,w} = (0_0,\cdots,0_{c-1},1_{c},0_{c+1},\cdots,0_{C-1})$

The formula can be rewrited as:

$\mathscr{L}_{CCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}-\log(p_{\underline{\color{red}g_{h,w}},h,w})$

where $g$ is a matrix with shape H×W, whose element $g_{h,w}\in\{0,1,\cdots,C-1\}$ represents the ground truth label of pixel (h,w), and $p$ is a matrix with shape C×H×W, whose element $p_{c,h,w}\in[0,1]$ is the predicted value of pixel (h,w) belonging to label c.

The rewritten formula makes it easier to write code

CrossEntropy $\Leftrightarrow$ Accuracy

Why not just use Accuracy as Loss function?

$\begin{gather*} \mathscr{L}_{CCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}-\log(\fbox{$p_{\textcolor{red}{g_{h,w}},h,w}$})\\ \Downarrow\\ \mathscr{L}_{CCE}(g,p) = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}-\log(\fbox{$TP_{\textcolor{red}{g_{h,w}},h,w}$}) \end{gather*}$

$\text{Accuracy} = \frac{1}{H\times W}\sum_{c} TP_c = \frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\fbox{$\lceil TP_{\textcolor{red}{g_{h,w}},h,w} \rceil$}$

where $\lceil\cdot\rceil$ is the round up operator and $\lceil TP_{\textcolor{red}{g_{h,w}},h,w} \rceil \in \{0,1\}$

$\lceil TP_{\textcolor{red}{g_{h,w}},h,w} \rceil$ will round up to 1 if $TP_{\textcolor{red}{g_{h,w}},h,w}$ is the max value among $TP_{h,w}$

Focal loss

Original Paper
2017-ICCV: Focal Loss for Dense Object Detection
Description
The Focal loss is a variant of the binary cross entropy loss that addresses the issue of class imbalance with the standard cross entropy loss by down-weighting the contribution of easy examples enabling learning of harder examples.

The Focal loss $\mathscr{L}_{F}$ adds a modulating factor to the cross entropy loss:

$\mathscr{L}_{F}(g,p) = -\frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\alpha_{\textcolor{red}{g_{h,w}}}(1-p_{\textcolor{red}{g_{h,w}},h,w})^\gamma\cdot\log(p_{\textcolor{red}{g_{h,w}},h,w})$

where $\gamma\uparrow \Rightarrow \alpha(1-p)^\gamma\downarrow$ and $p\uparrow \Rightarrow \alpha(1-p)^\gamma\downarrow$

Dice Loss

The Sørensen-Dice index, known as the Dice similarity coefficient (DSC) when applied to Boolean data, is the most commonly used metric for evaluating segmentation accuracy. We can define DSC in terms of the per voxel classification of true positives (TP), false positives (FP) and false negatives (FN):

$DSC_c = \frac{2TP_c}{2TP_c+FP_c+FN_c}$

The Dice loss $\mathscr{L}_{DSC}$ can therefore be defined as:

$\mathscr{L}_{DSC} = \sum_{c=0}^{C-1} 1 - DSC_c$

The Dice loss is somewhat adapted to handle class imbalance. However, the Dice loss gradient is inherently unstable, most evident with highly class imbalanced data where gradient calculations involve small denominators.

DiceLoss $\Leftrightarrow$ F1 score

$DSC = \frac{2TP}{2TP+FP+FN} = \frac{2}{\frac{TP+FP}{TP} + \frac{TP+FN}{TP}} = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} = F_1$

$\mathscr{L}_{DSC} = 1 - DSC = 1 - F_1$

Tversky loss

Original Paper
2017 Tversky loss function for image segmentation using 3D fully convolutional deep networks
Description
The Tversky Index is closely related to the DSC, but enables optimisation for output imbalance by assigning weights $\alpha$ and $\beta$ to false positives and false negatives, respectively:

$\begin{split} TI_c &= \frac{TP_c}{TP_c + \alpha FP_c + \beta FN_c}\\ &= \frac{\sum_{h,w}p_{c,h,w}g_{c,h,w}}{\sum_{h,w}p_{c,h,w}g_{c,h,w} + \alpha\sum_{h,w}p_{c,h,w}(\sum_{i\neq c}g_{i,h,w}) + \beta\sum_{h,w}(\sum_{i\neq c}p_{i,h,w})g_{c,h,w}}\\ &= \frac{\sum_{h,w}p_{c,h,w}g_{c,h,w}}{\sum_{h,w}p_{c,h,w}g_{c,h,w} + \alpha\sum_{h,w}p_{c,h,w}(1-g_{c,h,w}) + \beta\sum_{h,w}(1-p_{c,h,w})g_{c,h,w}} \end{split}$

where $g_{c,h,w}$ uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and $p_{c,h,w}\in[0,1]$ is the predicted value of the pixel (h,w) belonging to label c.

$g_{c,h,w} = (0_0,\cdots,0_{c-1},1_{c},0_{c+1},\cdots,0_{C-1})$

Using the Tversky Index, we define the Tversky loss $\mathscr{L}_{T}$ for a segmentation task with C catatories as:

$\mathscr{L}_{T} = \sum_{c=0}^{C-1}(1-TI_c)$

When the Dice loss function is applied to class imbalanced problems, the resulting segmentation often exhibits high precision but low recall scores. Assigning a greater weight to false negatives improves recall and results in a better balance of precision and recall.Therefore, $\beta$ is often set higher than $\alpha$ , most commonly $\beta = 0.7$ and $\alpha = 0.3$ .

Focal Tversky loss

Original Paper
2018-10 A novel focal Tversky loss function with improved attention U-Net for lesion segmentation
Description
Using the definition of TI, the Focal Tversky loss $\mathscr{L}_{FT}$ is defined as:

$\mathscr{L}_{FT} = \sum_{c=0}^{C-1}(1-TI_c)^{\frac{1}{\gamma}}$

where $\gamma < 1$ increases the degree of focusing on harder examples. The Focal Tversky loss simplifies to the Tversky loss when $\gamma = 1$ .
The optimal value reported was $\gamma = 4∕3$ , which enhances rather than suppresses the loss of easy examples.

Combo loss

Original Paper
2018-05 Combo loss: handling input and output imbalance in multiorgan segmentation.
Description
The Combo loss $\mathscr{L}_{combo}$ is defined as a weighted sum of the DSC and a modified form of the cross entropy loss $\mathscr{L}_{mCE}$ :

$\mathscr{L}_{combo} = \alpha\mathscr{L}_{mCE} - (1-\alpha) \cdot DSC$

where:

$\mathscr{L}_{mCE} = -\frac{1}{H \times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1} \beta_{\color{red}g_{h,w}}\cdot\log(p_{\underline{\color{red}g_{h,w}},h,w})$

and $\alpha \in [0,1]$ controls the relative contribution of the Dice and Cross Entropy terms to the loss, and $\beta_0,\cdots,\beta_{C-1}$ controls the relative weights assigned to the True Positives of every catagory. A larger proportion of the value of $\beta_c$ leads to a larger weight of TP and a larger penalty of FN for the corresponding category c.

$\beta_c$ 相对与其它类别 $\beta$ 的值越大，对应c类的TP权重就越大，相应的对c类FN“惩罚”的权重就越大。
因为 $-\beta_c\cdot\log(p_{c})$ 是关于 $p_c$ 递减的， $p_c$ 越大loss就越小。 $\beta_c$ 越大， $-\log(p_c)$ 的权重就越大，loss会偏向于指导模型在g=c的像素上输出更大的 $p_c$ (TP)，相应的输出更小的 $p_{\cancel c}$ (FN). $p_{\cancel c}$ 指除了c类的其它类别的预测概率。

DiceFocal loss

Original Paper
2019-02 Boundary-weighted domain adaptive neural network for prostate mr image segmentation

Hybrid Focal loss

Original Paper
2021-05 Focus U-Net: a novel dual attention-gated CNN for polyp segmentation during colonoscopy.
Description
The Hybrid Focal loss $\mathscr{L}_{HF}$ is defined as:

$\mathscr{L}_{HF} = \lambda\mathscr{L}_{F} + (1-\lambda)\mathscr{L}_{FT}$

where $\lambda \in [0,1]$ and determines the relative weighting of the two component loss functions.

Unified Focal loss

Original Paper
2021-02 Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation
Description
Firstly, the authers replaced $\alpha$ in the Focal loss and $\alpha$ and $\beta$ in the Tversky Index with a common $\delta$ parameter to control output imbalance, and reformulate $\gamma$ to enable simultaneous Focal loss suppression and Focal Tversky loss enhancement, naming these the modified Focal loss $\mathscr{L}_{mF}$ and modified Focal Tversky loss $\mathscr{L}_{mFT}$ , respectively:

$\begin{split} \mathscr{L}_{mF} &= -\frac{1}{H\times W}\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\delta_{\textcolor{red}{g_{h,w}}}(1-p_{\textcolor{red}{g_{h,w}},h,w})^{1-\gamma}\cdot\log(p_{\textcolor{red}{g_{h,w}},h,w}) \\ \mathscr{L}_{mFT} &= \sum_{c=0}^{C-1}(1-\text{mTI}_c)^{\gamma} \end{split}$

where $\text{mTI}_c$ is redefined as following:

$\begin{gather*} \text{mTI}_c = \frac{TP_c}{TP_c + \delta\cdot FP_c + (1-\delta)\cdot FN_c} \\ \footnotesize TP_c = \sum_{h,w}p_{c,h,w}g_{c,h,w},\quad FP_c = \sum_{h,w}p_{c,h,w}(1-g_{c,h,w}),\quad FN_c = \sum_{h,w}(1-p_{c,h,w})g_{c,h,w} \end{gather*}$

where $g_{c,h,w}$ uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and $p_{c,h,w}\in[0,1]$ is the predicted value of the pixel (h,w) belonging to label c.

$g_{c,h,w} = (0_0,\cdots,0_{c-1},1_{c},0_{c+1},\cdots,0_{C-1})$

The symmetric variant of the Unified Focal loss $\mathscr{L}_{sUF}$ is defined as:

$\mathscr{L}_{sUF} = \lambda\mathscr{L}_{mF} + (1-\lambda)\mathscr{L}_{mFT}$

where $\lambda\in[0,1]$ determining the relative weighting of the two losses.
By grouping functionally equivalent hyperparameters, the six hyperparameters associated with the Hybrid Focal loss are reduced to three, with
- $\delta$ controlling the relative weighting of positive and negative examples,
- $\gamma$ controlling both suppression of the background class and enhancement of the rare class,
- $\lambda$ determining the weights of the two component losses.

SimCLR

A simple framework for contrastive learning of visual representations.

$\begin{split} \underset{\text{mini batch}}{\{x_1, x_2, \cdots, x_N\}} \xrightarrow[t'\sim\tau]{t\sim\tau}& \left\{ \begin{array}{llll} \tilde{x}_1, &\tilde{x}_2, &\cdots &\tilde{x}_N,\\ \tilde{x}_{N+1}, &\tilde{x}_{N+2}, &\cdots &\tilde{x}_{2N} \end{array} \right\} \\ \xrightarrow[\text{ResNet+MLP}]{f+g}& \left\{ \begin{array}{llll} z_1, &z_2, &\cdots &z_N,\\ z_{N+1}, &z_{N+2}, &\cdots &z_{2N} \end{array} \right\}\ \end{split}$

Given a positive pair $(i, N+i)$ , we treat the other $2(N−1)$ augmented examples within a minibatch as negative examples.

$\begin{split} l_{i, N+i} &=-\log\frac{exp({\rm sim}(z_i,z_{N+i})/\tau)} {\sum_{k=1,k\neq i}^{2N} exp({\rm sim}(z_i, z_k)/\tau)} \\ &= -\log\frac{m(z_i, z_{N+i})} {\begin{array}{lcccl} m(z_i,z_1) & + \cdots + & 0 & + \cdots + & m(z_i,z_N)+\\ m(z_i,z_{N+1}) & + \cdots + & m(z_i,z_{N+i}) & + \cdots + & m(z_i,z_{2N}) \end{array}} \end{split}$

$m(z_i,z_j) = exp({\rm sim}(z_i,z_j)/\tau),\quad j=1,\cdots,2N$

cosine similarity:

${\rm sim}(z_i, z_j) = \frac{z_i \cdot z_j}{|z_i||z_j|} = \cos(<z_i,z_j>)$

Note: $m(z_i, z_j) = m(z_j, z_i)$ , but $l_{i, j} \neq l_{j,i}$ .
They have the same numerator, but their denominators are quite different.

$\begin{split} l_{N+i,i} &=-\log\frac{exp({\rm sim}(z_{N+i},z_{i})/\tau)} {\sum_{k=1,k\neq N+i}^{2N} exp({\rm sim}(z_{N+i}, z_k)/\tau)} \\ &= -\log\frac{m(z_{N+i}, z_{i})} {\begin{array}{lcccl} m(z_{N+i},z_1) & + \cdots + & m(z_{N+i},z_{i}) & + \cdots + & m(z_{N+i},z_N)+\\ m(z_{N+i},z_{N+1}) & + \cdots + & 0 & + \cdots + & m(z_{N+i},z_{2N}) \end{array}} \end{split}$

The final loss is computed across all positive pairs, both $(i, j)$ and $(j, i)$ , in a mini-batch.

$L =\frac{1}{2N}\sum_{i=1}^N (l_{i, N+i} + l_{N+i, i})$

Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google for neural network machine learning, using Google’s own TensorFlow software. Google began using TPUs internally in 2015, and in 2018 made them available for third party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.