prepare synapse dataset

Image Sequentialization
Input xx are image of size 224x224
reshape the input xRH×W×Cx \in \R^{H\times W\times C} into a sequence of flattened 2D patches {xp1,xp2,,xpN}R1×P2C,N=HWCP2C\{x_p^1, x_p^2, \cdots, x_p^N\}\in\R^{1\times P^2C},N=\frac{HWC}{P^2C}, where P denotes Patch_size and each patch is of size P×PP\times P.

Patch Embedding:

z0=[xp1Exp2ExpNE]+Eposz_0 = \begin{bmatrix}x_p^1E \\ x_p^2E \\ \vdots \\ x_p^NE\end{bmatrix} + E_{pos}

where ERP2C×DE\in \R^{P^2C\times D} is the patch embedding projection matrix, EposRN×D=RHWP2×DE_{pos}\in \R^{N\times D} = \R^{\frac{HW}{P^2}\times D} denotes the position embedding.

Pure Transformer Encoder:

zl=MSA(LN(zl1))+zl1zl=MLP(LN(zl))+zll=1,2,,LzLRHWP2×D\begin{split} z'_l &= MSA(LN(z_{l-1})) + z_{l-1}\\ z_l &= MLP(LN(z'_l)) + z'_l \end{split}\\ l = 1,2,\cdots,L\\ \Downarrow\\ z_L \in \R^{\frac{HW}{P^2}\times D}

where MSA denotes Multihead Self-Attention, LN denotes Layer Normalization, MLP denotes Multi-Lyaer Perceptron

CNN-Transformer Hybrid as Encoder:

CNN feature extractor + Transformer

raw image1st feature extractorCNNfeature mappatch embeddingxE+Eposz02nd feature extractorTransformerzL\text{raw image} \xrightarrow[\text{1st feature extractor}]{\text{CNN}} \text{feature map} \xrightarrow[\text{patch embedding}]{xE+E_{pos}} z_0 \xrightarrow[\text{2nd feature extractor}]{\text{Transformer}} z_L

Cascaded Upsampler:

TransUNet = CNN-Transformer Hybrid Encoder + Cascaded Upsampler

Evaluation

Sørensen–Dice coefficient (DSC)