Semantic Segmentation
Self-Supervised learning
Self-Supervised Representation Learning | Lil’Log
self-supervised task is also known as pretext task
Broadly speaking, all the generative models can be considered as self-supervised, but with different goals: Generative models focus on creating diverse and realistic images, while self-supervised representation learning care about producing good features generally helpful for many tasks.
RGB-IR
通常 remote sensing image (RSI) 图像有三种颜色模式
- IRRG: 3 channels (IR-R-G)
- RGB: 3 channels (R-G-B)
- RGBIR: 4 channels (R-G-B-IR)
其中 RGB 即 Red Green Blue, 而 IR 指 InfraRed (红外)
数据集
ISPRS, International Society for Photogrammetry and Remote Sensing, 国际摄影测量与遥感学会, 是一个以推动国际摄影测量与遥感的发展、应用、合作为宗旨的非政府组织.
其发起的 ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling 项目中包括两个遥感图像分割领域的知名数据集, Potsdam 和 Vaihingen 数据集.
Note: ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling 中附有数据集的下载链接.
- Vaihingen: 2D Semantic Labeling - Vaihingen data
Note: Vaihingen 数据集一共 33 个 patch, 缺少标号 9 18 19 25 36 的 patch.
关于 GLCNet
global style and local matching contrastive learning network (GLCNet)
- Global style contrastive learning module
- Local matching contrastive learning module
网络结构
Input: xi
Data augmentation:
x~i=t1(xi),x^i=t2(xi)
t1(⋅) represents random cropping followed by resizing to a fixed resolution (e.g. 224×224)
t2(⋅) represents sequentially applying several augmentations: random cropping followed by resizing to a fixed resolution, random flipping, random rotating, random color distortion, random Gaussian blur, random noise, and random gray.
Feature extraction:
f~i=μ(e(x~i)),f^i=μ(e(x^i))
e(⋅) is the encoder of the semantic segmentation network DeepLabV3+
μ(⋅) represents the calculation of the mean value of each channel in the feature map (i.e. the global average pooling).
Encoder
Projection head:
z~i=g(f~i)=W~(2)⋅ReLU(W~(1)⋅f~i)z^i=g(f^i)=W^(2)⋅ReLU(W^(1)⋅f^i)
存疑: W^=?W~
Projection head g(⋅) is an MLP with one hidden layer (with ReLU).
The presence of g(⋅) in SimCLR has been proven to be very beneficial, possibly because it allows the e(⋅) to form and retain more potentially useful information for downstream tasks.
Contrastive loss:
Lc=2N1k=1∑N(l(x~i,x^i)+l(x^i,x~i))
l(x~i,x^i)=−log∑x∈Λ−exp(sim(z~i,g(f(x)))/τ)exp(sim(z~i,z^i)/τ)
sim(⋅) denotes the similarity measure function between two feature vectors and, in this work, is the cosine similarity.
Λ− denotes 2(N−1) negative samples in addition to the positive sample pair.
τ denotes a temperature parameter.
GLCNet:
Global style contrastive learning module
we calculate the ==channel-wise mean== and ==variance of the features extracted by the encoder e(⋅)== to extract the global style feature vector.
fs(xi)=concat(μ(e(xi)),σ(e(xi)))
μ(e(xi)): channel-wise mean of the feature map.
σ(e(xi)): variance of the features extracted by the encoder e(⋅)
Global style contrastive learning loss is defined as follows
LG=2N1k=1∑N(lg(x~i,x^i)+lg(x^i,x~i))
lg(x~i,x^i)=−log∑x∈Λ−exp(sim(z~is,g(fs(x)))/τ)exp(sim(z~is,z^is)/τ)
where z~is=g(fs(x~i)),z^is=g(fs(x^i))
Local matching contrastive learning module
First, the land cover categories in a single image in the semantic segmentation dataset are extremely rich. Extracting only the global features of the whole image to measure and distinguish images will result in the loss of much information;
Second, instance-wise contrastive learning methods are used to obtain image-level features that may be suboptimal for semantic segmentation requiring pixellevel discrimination.
Therefore, the local matching contrastive learning module is designed to learn the representation of local regions, which is beneficial for pixel-level semantic segmentation.
Local region selection and matching
we record the pixel position by introducing an index label to ensure that the center positions of the two matching local regions correspond to each other in the original image.
Local matching feature extraction:
f~Lj=fL(p~j)=μ(p~j)←p~j←d(e(x~))DecoderDeepLabV3+e(x~)EncoderDeepLabV3+x~
f^Lj=fL(p^j)=μ(p^j)←p^j←d(e(x^))DecoderDeepLabV3+e(x^)EncoderDeepLabV3+x^
μ(⋅) represents the calculation of the mean value of each channel in the feature map
Local matching contrastive loss:
LL=2NL1k=1∑NL(lL(p~j,p^j)+lL(p^j,p~j))
lL(p~j,p^j)=−log∑p∈ΛL−exp(sim(μ~j,gL(fL(p)))/τ)exp(sim(μ~j,μ^j)/τ)
μ~j=gL(f~Lj)=gL(fL(p~j)),μ^j=gL(f^Lj)=gL(fL(p^j))
NL denotes the number of all local regions selected from a mini-batch of N samples, i.e. NL=N×np, where np is the number of matched local regions obtained from a sample.
ΛL− is a set of feature maps corresponding to all local regions except the two matched local regions.
gL(⋅) is a projection head that is similar to g(⋅).
Total loss:
L=λ⋅LG+(1−λ)LL
λ=0.5 in this paper
Loss Funcitons
Cross Entropy Loss
-
Wiki to Cross Entropy
Cross Entropy -
Pytorch docs to CrossEntropyLoss
torch.nn.CrossEntropyLossLCE=−h,w∑c∑log(Pc,h,w)⋅IGh,w=c=TP+TN
Where Pc,h,w∈[0,1], Gh,w∈{0,⋯,C−1}
Binary and Muti-Class Cross Entropy Loss
For a binary classification/segmentation problem, for example foreground and background in segmentation, the binary cross entropy loss (LBCE) is defined as the following:
LBCE(g,p)=−(glog(p))+(1−g)log(1−p)
Here, g∈{0,1},p∈[0,1], where g refers to the ground truth label of sample and p refers to the predicted value of the pixel(or sample) belongs to label g. Besides, this formula is merely for calculation in respect to one pixel.
For an output prediction mask with shape H×W and a ground truth mask with shape H×W, the binary cross entropy loss LBCE is computed as:
LBCE(g,p)=H×W1h=0∑H−1w=0∑W−1−gh,wlog(ph,w)+(1−gh,w)log(1−ph,w)
Here, gh,w∈{0,1},ph,w∈[0,1], where gh,w refers to the ground truth label of pixel (h,w) and ph,w refers to the predicted value of the pixel (h,w) belonging to label gh,w.
This can be extended to multi-class problems, and the categorical cross entropy loss (LCCE) is computed as:
LCCE(g,p)=H×W1h=0∑H−1w=0∑W−1c=0∑C−1−gc,h,wlog(pc,h,w)
where gc,h,w uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and pc,h,w∈[0,1] is the predicted value of the pixel (h,w) belonging to label c.
gc,h,w=(00,⋯,0c−1,1c,0c+1,⋯,0C−1)
The formula can be rewrited as:
LCCE(g,p)=H×W1h=0∑H−1w=0∑W−1−log(pgh,w,h,w)
where g is a matrix with shape H×W, whose element gh,w∈{0,1,⋯,C−1} represents the ground truth label of pixel (h,w), and p is a matrix with shape C×H×W, whose element pc,h,w∈[0,1] is the predicted value of pixel (h,w) belonging to label c.
The rewritten formula makes it easier to write code
CrossEntropy ⇔ Accuracy
Why not just use Accuracy as Loss function?
LCCE(g,p)=H×W1h=0∑H−1w=0∑W−1−log(pgh,w,h,w)⇓LCCE(g,p)=H×W1h=0∑H−1w=0∑W−1−log(TPgh,w,h,w)
Accuracy=H×W1c∑TPc=H×W1h=0∑H−1w=0∑W−1⌈TPgh,w,h,w⌉
where ⌈⋅⌉ is the round up operator and ⌈TPgh,w,h,w⌉∈{0,1}
⌈TPgh,w,h,w⌉ will round up to 1 if TPgh,w,h,w is the max value among TPh,w
Focal loss
-
Original Paper
2017-ICCV: Focal Loss for Dense Object Detection -
Description
The Focal loss is a variant of the binary cross entropy loss that addresses the issue of class imbalance with the standard cross entropy loss by down-weighting the contribution of easy examples enabling learning of harder examples.The Focal loss LF adds a modulating factor to the cross entropy loss:
LF(g,p)=−H×W1h=0∑H−1w=0∑W−1αgh,w(1−pgh,w,h,w)γ⋅log(pgh,w,h,w)
where γ↑⇒α(1−p)γ↓ and p↑⇒α(1−p)γ↓
Dice Loss
The Sørensen-Dice index, known as the Dice similarity coefficient (DSC) when applied to Boolean data, is the most commonly used metric for evaluating segmentation accuracy. We can define DSC in terms of the per voxel classification of true positives (TP), false positives (FP) and false negatives (FN):
DSCc=2TPc+FPc+FNc2TPc
The Dice loss LDSC can therefore be defined as:
LDSC=c=0∑C−11−DSCc
The Dice loss is somewhat adapted to handle class imbalance. However, the Dice loss gradient is inherently unstable, most evident with highly class imbalanced data where gradient calculations involve small denominators.
DiceLoss ⇔ F1 score
DSC=2TP+FP+FN2TP=TPTP+FP+TPTP+FN2=Precision1+Recall12=F1
LDSC=1−DSC=1−F1
Tversky loss
-
Original Paper
2017 Tversky loss function for image segmentation using 3D fully convolutional deep networks -
Description
The Tversky Index is closely related to the DSC, but enables optimisation for output imbalance by assigning weights α and β to false positives and false negatives, respectively:TIc=TPc+αFPc+βFNcTPc=∑h,wpc,h,wgc,h,w+α∑h,wpc,h,w(∑i=cgi,h,w)+β∑h,w(∑i=cpi,h,w)gc,h,w∑h,wpc,h,wgc,h,w=∑h,wpc,h,wgc,h,w+α∑h,wpc,h,w(1−gc,h,w)+β∑h,w(1−pc,h,w)gc,h,w∑h,wpc,h,wgc,h,w
where gc,h,w uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and pc,h,w∈[0,1] is the predicted value of the pixel (h,w) belonging to label c.
gc,h,w=(00,⋯,0c−1,1c,0c+1,⋯,0C−1)
Using the Tversky Index, we define the Tversky loss LT for a segmentation task with C catatories as:
LT=c=0∑C−1(1−TIc)
When the Dice loss function is applied to class imbalanced problems, the resulting segmentation often exhibits high precision but low recall scores. Assigning a greater weight to false negatives improves recall and results in a better balance of precision and recall.Therefore, β is often set higher than α, most commonly β=0.7 and α=0.3.
Focal Tversky loss
-
Original Paper
2018-10 A novel focal Tversky loss function with improved attention U-Net for lesion segmentation -
Description
Using the definition of TI, the Focal Tversky loss LFT is defined as:LFT=c=0∑C−1(1−TIc)γ1
where γ<1 increases the degree of focusing on harder examples. The Focal Tversky loss simplifies to the Tversky loss when γ=1.
The optimal value reported was γ=4∕3, which enhances rather than suppresses the loss of easy examples.
Combo loss
-
Original Paper
2018-05 Combo loss: handling input and output imbalance in multiorgan segmentation. -
Description
The Combo loss Lcombo is defined as a weighted sum of the DSC and a modified form of the cross entropy loss LmCE:Lcombo=αLmCE−(1−α)⋅DSC
where:
LmCE=−H×W1h=0∑H−1w=0∑W−1βgh,w⋅log(pgh,w,h,w)
and α∈[0,1] controls the relative contribution of the Dice and Cross Entropy terms to the loss, and β0,⋯,βC−1 controls the relative weights assigned to the True Positives of every catagory. A larger proportion of the value of βc leads to a larger weight of TP and a larger penalty of FN for the corresponding category c.
βc 相对与其它类别β的值越大,对应c类的TP权重就越大,相应的对c类FN“惩罚”的权重就越大。
因为−βc⋅log(pc)是关于pc递减的,pc越大loss就越小。βc越大,−log(pc)的权重就越大,loss会偏向于指导模型在g=c的像素上输出更大的pc(TP),相应的输出更小的pc(FN). pc指除了c类的其它类别的预测概率。
DiceFocal loss
- Original Paper
2019-02 Boundary-weighted domain adaptive neural network for prostate mr image segmentation
Hybrid Focal loss
-
Original Paper
2021-05 Focus U-Net: a novel dual attention-gated CNN for polyp segmentation during colonoscopy. -
Description
The Hybrid Focal loss LHF is defined as:LHF=λLF+(1−λ)LFT
where λ∈[0,1] and determines the relative weighting of the two component loss functions.
Unified Focal loss
-
Original Paper
2021-02 Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation -
Description
Firstly, the authers replaced α in the Focal loss and α and β in the Tversky Index with a common δ parameter to control output imbalance, and reformulate γ to enable simultaneous Focal loss suppression and Focal Tversky loss enhancement, naming these the modified Focal loss LmF and modified Focal Tversky loss LmFT, respectively:LmFLmFT=−H×W1h=0∑H−1w=0∑W−1δgh,w(1−pgh,w,h,w)1−γ⋅log(pgh,w,h,w)=c=0∑C−1(1−mTIc)γ
where mTIc is redefined as following:
mTIc=TPc+δ⋅FPc+(1−δ)⋅FNcTPcTPc=h,w∑pc,h,wgc,h,w,FPc=h,w∑pc,h,w(1−gc,h,w),FNc=h,w∑(1−pc,h,w)gc,h,w
where gc,h,w uses a one-hot encoding scheme for ground truth labels of pixel (h,w), and pc,h,w∈[0,1] is the predicted value of the pixel (h,w) belonging to label c.
gc,h,w=(00,⋯,0c−1,1c,0c+1,⋯,0C−1)
The symmetric variant of the Unified Focal loss LsUF is defined as:
LsUF=λLmF+(1−λ)LmFT
where λ∈[0,1] determining the relative weighting of the two losses.
By grouping functionally equivalent hyperparameters, the six hyperparameters associated with the Hybrid Focal loss are reduced to three, with- δ controlling the relative weighting of positive and negative examples,
- γ controlling both suppression of the background class and enhancement of the rare class,
- λ determining the weights of the two component losses.
SimCLR
mini batch{x1,x2,⋯,xN}t∼τt′∼τf+gResNet+MLP{x~1,x~N+1,x~2,x~N+2,⋯⋯x~N,x~2N}{z1,zN+1,z2,zN+2,⋯⋯zN,z2N}
Given a positive pair (i,N+i), we treat the other 2(N−1) augmented examples within a minibatch as negative examples.
li,N+i=−log∑k=1,k=i2Nexp(sim(zi,zk)/τ)exp(sim(zi,zN+i)/τ)=−logm(zi,z1)m(zi,zN+1)+⋯++⋯+0m(zi,zN+i)+⋯++⋯+m(zi,zN)+m(zi,z2N)m(zi,zN+i)
m(zi,zj)=exp(sim(zi,zj)/τ),j=1,⋯,2N
cosine similarity:
sim(zi,zj)=∣zi∣∣zj∣zi⋅zj=cos(<zi,zj>)
Note: m(zi,zj)=m(zj,zi), but li,j=lj,i.
They have the same numerator, but their denominators are quite different.
lN+i,i=−log∑k=1,k=N+i2Nexp(sim(zN+i,zk)/τ)exp(sim(zN+i,zi)/τ)=−logm(zN+i,z1)m(zN+i,zN+1)+⋯++⋯+m(zN+i,zi)0+⋯++⋯+m(zN+i,zN)+m(zN+i,z2N)m(zN+i,zi)
The final loss is computed across all positive pairs, both (i,j) and (j,i), in a mini-batch.
L=2N1i=1∑N(li,N+i+lN+i,i)
Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google for neural network machine learning, using Google’s own TensorFlow software. Google began using TPUs internally in 2015, and in 2018 made them available for third party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.