TransUNet

prepare synapse dataset

Image Sequentialization
Input $x$ are image of size 224x224
reshape the input $x \in \R^{H\times W\times C}$ into a sequence of flattened 2D patches $\{x_p^1, x_p^2, \cdots, x_p^N\}\in\R^{1\times P^2C},N=\frac{HWC}{P^2C}$ , where P denotes Patch_size and each patch is of size $P\times P$ .

Patch Embedding:

$z_0 = \begin{bmatrix}x_p^1E \\ x_p^2E \\ \vdots \\ x_p^NE\end{bmatrix} + E_{pos}$

where $E\in \R^{P^2C\times D}$ is the patch embedding projection matrix, $E_{pos}\in \R^{N\times D} = \R^{\frac{HW}{P^2}\times D}$ denotes the position embedding.

Pure Transformer Encoder:

$\begin{split} z'_l &= MSA(LN(z_{l-1})) + z_{l-1}\\ z_l &= MLP(LN(z'_l)) + z'_l \end{split}\\ l = 1,2,\cdots,L\\ \Downarrow\\ z_L \in \R^{\frac{HW}{P^2}\times D}$

where MSA denotes Multihead Self-Attention, LN denotes Layer Normalization, MLP denotes Multi-Lyaer Perceptron

CNN-Transformer Hybrid as Encoder:

CNN feature extractor + Transformer

$\text{raw image} \xrightarrow[\text{1st feature extractor}]{\text{CNN}} \text{feature map} \xrightarrow[\text{patch embedding}]{xE+E_{pos}} z_0 \xrightarrow[\text{2nd feature extractor}]{\text{Transformer}} z_L$

Cascaded Upsampler:

TransUNet = CNN-Transformer Hybrid Encoder + Cascaded Upsampler

Evaluation

Sørensen–Dice coefficient (DSC)