算法模块 | 文羊羽

参考

C++

原子操作

深度学习模块

如何理解过拟合

数据集类比于约束项，模型参数类比于自由度。约束项越少，自由度约大，就越不容易确定一个具体的模型。例如只有两个点的数据，希望得到一个线性模型。如果用 ax + b, 两个参数的模型去拟合是刚刚好的；如果用 ax^2 + bx + c, 三个参数的模型去拟合就会出现多种情况，不一定会得到 a=0 的线性情况。此时的解决方法，一方面是提高数据的多样性，增加数据量；另一方面是对模型加以限制，让它能够更偏向于理想的模型。

关于参数更新次数

模型训练过程中, 模型参数的更新次数取决于当前参数到全局/局部最优点的"距离". 同一个模型, 针对简单的数据集, 很有可能更新了非常多的轮次 loss 还在继续下降, 因为还没有达到全局最优. 而针对复杂的数据集, 很有可能更新很少的轮次 loss 就平稳了, 因为陷入了局部最优解.

如何理解激活函数

为什么需要激活函数?

基于非线性的考虑 (例如非线性激活函数 Sigmoid)
基于条件分支的考虑, 使用函数来实现if-else的分支效果 (例如分段线性函数ReLu, 单位阶跃函数)

Batch Normalization

import torch
from torch import nn

def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
  # 通过is_grad_enabled来判断当前模式是训练模式还是预测模式
  if not torch.is_grad_enabled():
    # 如果是在预测模式下，直接使用传入的移动平均所得的均值和方差
    # moving_mean 和 moving_var 是全局的均值和方差
    X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
  else:
    assert len(X.shape) in (2, 4)
    if len(X.shape) == 2:
      # shape: [B, C]
      # 使用全连接层的情况，计算特征维上的均值和方差
      mean = X.mean(dim=0)
      var = ((X - mean) ** 2).mean(dim=0)
    else:
      # shape: [B, C, H, W]
      # 使用二维卷积层的情况，计算通道维上（axis=1）的均值和方差
      # 这里我们需要保持X的形状以便后面可以做广播运算
      mean = X.mean(dim=(0, 2, 3), keepdim=True)
      var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
    # 训练模式下，用当前的均值和方差做标准化
    X_hat = (X - mean) / torch.sqrt(var + eps)
    # 更新移动平均的均值和方差
    moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
    moving_var = momentum * moving_var + (1.0 - momentum) * var
  Y = gamma * X_hat + beta  # 缩放和移位
  return Y, moving_mean.data, moving_var.data

class BatchNorm(nn.Module):
  # num_features：完全连接层的输出数量或卷积层的输出通道数。
  # num_dims：2表示完全连接层，4表示卷积层
  def __init__(self, num_features, num_dims):
    super().__init__()
    if num_dims == 2:
      shape = (1, num_features)
    else:
      shape = (1, num_features, 1, 1)
    # 参与求梯度和迭代的拉伸和偏移参数，分别初始化成1和0
    self.gamma = nn.Parameter(torch.ones(shape))
    self.beta = nn.Parameter(torch.zeros(shape))
    # 非模型参数的变量初始化为0和1
    self.moving_mean = torch.zeros(shape)
    self.moving_var = torch.ones(shape)

  def forward(self, X):
    # 如果X不在内存上，将moving_mean和moving_var
    # 复制到X所在显存上
    if self.moving_mean.device != X.device:
      self.moving_mean = self.moving_mean.to(X.device)
      self.moving_var = self.moving_var.to(X.device)
    # 保存更新过的moving_mean和moving_var
    Y, self.moving_mean, self.moving_var = batch_norm(
      X, self.gamma, self.beta, self.moving_mean,
      self.moving_var, eps=1e-5, momentum=0.9)
    return Y
    
if __name__ == "__main__":

  B, C, H, W = 8, 3, 224, 224
  input = torch.rand(B, C, H, W)
  batch_norm2d = BatchNorm(num_features=3, num_dims=4)
  output = batch_norm2d(input)
  print(output.shape)

Layer Normalization

对于 (B, C, H*W) 的输入, LN 是在 (C, H*W) 的维度上做正则化的, 即 mean(dim=(-2,-1)) 和 std(dim=(-2,-1))

Dropout

Dropout 只在 Training 阶段使用, 在推理阶段不使用. 其属于一种对输入或者中间结果的扰动方法, 随机将其中的一些元素置为 0, 以此来提高模型的鲁棒性.

在一轮 forward 中，dropout 随机将一些中间结果置 0，这些被置 0 的元素对应的权重在此轮 backward 中不更新 (梯度为 0)，相应没被置 0 的元素对应的权重会得到更多的训练，这会让真正发挥作用的权重得到相对更多的训练，这是 dropout 有效的解释

import torch
from torch import nn

class Dropout(nn.Module):
  def __init__(self, p: float = 0.5) -> None:
    super().__init__()
    if p < 0 or p > 1:
      raise ValueError(f"dropout probability has to be between 0 and 1, but got {p}")
    self.p = p
  
  def forward(self, x):
    if self.training is False:
      return x
    if self.p == 1:
      return torch.zeros_like(x)
    if self.p == 0:
      return x
    mask = (torch.rand(x.shape) > self.p).float()
    ''' why /(1.0 - self.p)? : sum(x) == sum(dropout(x))'''
    return mask * x / (1.0 - self.p)
    
if __name__ == "__main__":

  B, T, C = 1, 3, 4
  input = torch.arange(B*T*C).reshape(B, T, C)
  print(input)

  model = Dropout(p=0.5)

  model.train()
  print(f"self.training: {model.training}")
  output_training = model(input)
  print(output_training)

  model.eval()
  print(f"self.training: {model.training}")
  output_evaluate = model(input)
  print(output_evaluate)

'''
tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]]])
self.training: True
tensor([[[ 0.,  2.,  0.,  6.],
         [ 0.,  0., 12., 14.],
         [ 0.,  0., 20.,  0.]]])
self.training: False
tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]]])
'''

Softmax

实现一个 Softmax, 其中 Softmax 的计算公式如下

$\text{softmax}([x_1,\cdots,x_n]) = \left[\frac{\exp(x_1)}{\sum_i^n\exp(x_i)}, \cdots, \frac{\exp(x_n)}{\sum_i^n\exp(x_i)}\right]$

但在实际实现中, 我们将 $[x_1,\cdots,x_n]$ 减去 $\max(x_1,\cdots,x_n)$ 再取 $\exp$ . 根据 Softmax 的平移不变性, 我们会得到相同的结果

$\frac{\exp(x_i - \max)}{\sum_i^n\exp(x_i - \max)} = \frac{\exp(x_i)/\exp(\max)}{\sum_i^n\exp(x_i)/\exp(\max)} = \frac{\exp(x_i)}{\sum_i^n\exp(x_i)}$

这样做的好处是防止溢出. 对于比较大的 x 而言, 再取 $\exp(x)$ 有可能会上溢出. 通过减去一个 $\max$ , 将 x 的值都平移到 $(-\infty, 0]$ , 这样 $\exp(x)$ 将会属于 $(0, 1]$ , 就不会发生溢出.

import torch
from torch import nn

class Softmax(nn.Module):
  def __init__(self, dim=0):
    super().__init__()
    self.dim = dim
  
  def forward(self, x):
    x_max = torch.max(x, dim=self.dim, keepdim=True)[0]
    x_exp = torch.exp(x-x_max)
    x_exp_sum = torch.sum(x, dim=self.dim, keepdim=True)
    output = x_exp/x_exp_sum
    return output

if __name__ == "__main__":
  B, T, C = 8, 192, 768
  input = torch.rand(B, T, C)
  output = Softmax(dim=-1)(input)
  print(output.shape)

Self-Attention

import torch
from torch import nn

'''
x.shape: [n, dim_in]
Q = x @ Wq, K = x @ Wk, V = x @ Wv
attention = softmax((Q @ K^T)/sqrt(dim_k)) @ V
'''
class SelfAttention(nn.Module):
  def __init__(self, dim_in, dim_q, dim_k, dim_v):
    super(SelfAttention, self).__init__()
    assert dim_k == dim_q  # dim_k == dim_q
    self.dim_in = dim_in
    self.dim_q = dim_k
    self.dim_k = dim_k
    self.dim_v = dim_v
    self.linear_q = nn.Linear(dim_in, dim_q, bias=False) # Wq
    self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
    self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
    self.norm = (dim_k)**(1/2)

  def forward(self, x):
    '''x: n, dim_in'''
    assert x.shape[-1] == self.dim_in
    q = self.linear_q(x) # q = x @ Wq
    k = self.linear_k(x) # k = x @ Wk
    v = self.linear_v(x) # v = X @ Wv

    # Q:(B, T, dim_q) @ K^T:(B, dim_k, T) -> Q@K^T:(B, T, T)
    attention = torch.matmul(q, k.transpose(-2,-1)) / self.norm
    attention = nn.Softmax(-1)(attention)
    # Q@K^T:(B, T, T) @ V:(B, T, dim_v) -> (Q@K^T)@V:(B, T, dim_v)
    attention = torch.matmul(attention, v)

    return attention

if __name__ == "__main__":
  # Batch_size, Time_step, Embedding_size
  B, T, C = 8, 12, 4
  input = torch.rand(B, T, C)
  # input:(B, T, C) -> Q:(B, T, dim_q), K:(B, T, dim_k), V:(B, T, dim_v)
  attention = SelfAttention(dim_in=C, dim_q=16, dim_k=16, dim_v=32)
  output = attention.forward(input)
  print(output.shape)

MultiHead-Attention

import torch
from torch import nn

'''
x.shape: [B, T, C]
Q = x @ Wq; K = x @ Wk; V = x @ Wv
Q.shape: [B, T, dim_q]; K.shape: [B, T, dim_k]; V.shape: [B, T, dim_v]
Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh]
Q1-Qh: [B, T, dim_q/h]; K1-Kh: [B, T, dim_k/h]; V1-Vh: [B, T, dim_v/h]
concat[attention(Q1,K1,V1), ..., attention(Qh,Kh,Vh)]
attention(Q1,K1,V1)-attention(Qh,Kh,Vh): [B, T, dim_v/h]
'''
class MultiHeadSelfAttention(nn.Module):
  def __init__(self, dim_in, dim_q, dim_k, dim_v, num_heads=8):
    super(MultiHeadSelfAttention, self).__init__()
    assert dim_q == dim_k # dim_q == dim_k
    self.dim_in = dim_in
    self.dim_q = dim_q
    self.dim_k = dim_k
    self.dim_v = dim_v
    self.num_heads = num_heads
    self.linear_q = nn.Linear(dim_in, dim_q, bias=False) # Wq
    self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
    self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
    self.norm_fact = (dim_k // num_heads)**(1/2)

  def forward(self, x):
    '''x.shape: [B, T, dim_in]'''
    B, T, dim_in = x.shape
    assert dim_in == self.dim_in
    dim_q = self.dim_q
    dim_k = self.dim_k
    dim_v = self.dim_v
    heads = self.num_heads

    q = self.linear_q(x).reshape(B, T, heads, dim_q//heads).transpose(1,2) # (B, heads, T, dim_q//heads)
    k = self.linear_k(x).reshape(B, T, heads, dim_k//heads).transpose(1,2) # (B, heads, T, dim_k//heads)
    v = self.linear_v(x).reshape(B, T, heads, dim_v//heads).transpose(1,2) # (B, heads, T, dim_v//heads)

    # Q:[B, h, T, dim_q//h] @ K^T: [B, h, dim_k//h, T] -> Q@K^T: [B, h, T, T]
    attention = torch.matmul(q, k.transpose(-2,-1)) / self.norm_fact
    attention = nn.Softmax(dim=-1)(attention)
    # Q@K^T: [B, h, T, T] @ V: [B, h, T, dim_v//h] -> (Q@K^T)@V: [B, h, T, dim_v//h]
    attention = torch.matmul(attention, v)
    # concat: [B, h, T, dim_v//h] -> [B, T, h, dim_v//h] -> (B, T, dim_v)
    attention = attention.transpose(1,2).reshape(B, T, dim_v)
    return attention
    
if __name__ == "__main__":
  # Batch_size, Time_step, Embedding_size
  B, T, C = 8, 12, 4
  input = torch.rand(B, T, C)
  # input:(B, T, C) -> Q:(B, T, dim_q), K:(B, T, dim_k), V:(B, T, dim_v)
  # Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh]
  # Q1-Qh: [B, T, dim_q/h]; K1-Kh: [B, T, dim_k/h]; V1-Vh: [B, T, dim_v/h]
  multihead = MultiHeadSelfAttention(dim_in=C, dim_q=16, dim_k=16, dim_v=32, num_heads=8)
  output = multihead.forward(input)
  print(output.shape)

PatchEmbedding in ViT

Vision Transformer 通过 Patch Embedding 操作将一个 CxHxW 的平面图像转换成一个 [(H//P)*(W//P), C*P*P] 大小的 “Token” 序列数据, 这样可以正常输入到 MultiHeadSelfAttention 中.

import torch
from torch import nn

class PatchEmbedding(nn.Module):
  '''
  2D Image to token embedding
  '''
  def __init__(self, channel, height, width, patch_size):
    super().__init__()
    self.C = channel
    self.H = height
    self.W = width
    self.P = patch_size
    self.embed_dim = channel * patch_size * patch_size
    '''Out = ⌊(In - Kernal + 2*Padding) / Stride⌋ + 1'''
    '''[B, C, H, W] -project-> [B, C*P*P, H//P, W//P]'''
    self.proj = nn.Conv2d(in_channels=channel, out_channels=self.embed_dim,
                          kernel_size=patch_size, stride=patch_size)

  def forward(self, x):
    B, C, H, W = x.shape
    assert H == self.H and W == self.W, \
      f"input shape {H}x{W} dosen't match model shape {self.H}x{self.W}"
    # [B, C, H, W] -project-> [B, C*P*P, H//P, W//P]
    x = self.proj(x)
    # [B, C*P*P, H//P, W//P] -flatten-> [B, C*P*P, (H//P)*(W//P)]
    x = x.flatten(start_dim=-2)
    # [B, C*P*P, (H//P)*(W//P)] -transpose-> [B, (H//P)*(W//P), C*P*P]
    x = x.transpose(-2, -1)
    return x

if __name__ == "__main__":
  # Batch_size, Channel, Height, Width, Patch_size
  B, C, H, W, P = 8, 3, 224, 224, 16
  input = torch.rand(B, C, H, W)
  '''
  (H//P)*(W//P): the number of patches, C*P*P: the embedding dimention of a patch
  [B, C, H, W] -> [B, (H//P)*(W//P), C*P*P]
  [8, 3, 224, 224] -> [8, 14*14, 3*16*16] = [8, 196, 768]
  '''
  embedding = PatchEmbedding(C, H, W, P)
  output = embedding(input)
  print(output.shape)

CUDA

reduction

#include <stdio.h>
#include <cuda.h>

int sum(const int* idata, int n) {
  int result = 0;
  for(int i=0; i<n; i++) {
    result += idata[i];
  }
  return result;
}

__global__ void reduction1(int *idata, int *odata, int n) {
  // thread index
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  // boundary check
  if(tid >= n) return;

  /*
  idx: 0, 1, 2, 3, 4, 5, 6, 7,
  tid: 0  0, 2  2, 4  4, 6  6,
  tid: 0     0,    4     4,
  tod: 0     0,
  */
  for(int stride = 1; tid+stride < n; stride *= 2) {
    // stride = 1, 2, 4, 8, ...
    if((tid % (2 * stride)) == 0) {
      idata[tid] = idata[tid] + idata[tid + stride];
    }
    // synchronize
    __syncthreads();
  }
  // get the result
  if(tid==0) *odata = idata[0];
}

__global__ void reduction2(int *idata, int *odata, int n) {
  // thread index
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  // boundary check
  if(tid >= n) return;

  /*
  idx: 0, 1, 2, 3, 4, 5, 6, 7,
  tid: 0  0, 1  1, 2  2, 3  3,
  tid: 0     0,    1     1,
  tid: 0     0,
  */
  for(int stride=1; stride<n; stride*=2) {
    int index = 2 * stride * tid;
    if(index + stride < n)
      idata[index] += idata[index + stride];
    // synchronize threads
    __syncthreads();
  }
  // get the result
  if(tid == 0) *odata = idata[0];
}

__global__ void reduction3(int *idata, int *odata, int n) {
  // thread index
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // boundary check
  if(tid >= n) return;

  /*
  idx: 0, 1, 2, 3, 4, 5, 6, 7,
  tid: 0, 1, 2, 3, 0, 1, 2, 3
  tid: 0, 1, 0, 1,
  tid: 0, 0
  */

  for(int stride=n/2; stride>=1; stride/=2) {
    if(tid < stride) {
      // idata[tid] += idata[tid + stride];
      atomicAdd(&idata[tid], idata[tid + stride]);
    }
    // synchronize threads
    __syncthreads();
  }
  // get the result
  if(tid == 0) *odata = idata[0];
}

int main() {
  int size = 1<<10; // 2^10 = 1024: total number of elements
  int bytes = size * sizeof(int); // bytes for int array

  int* h_idata = (int*)malloc(bytes); // input: an array
  int* h_odata = (int*)malloc(sizeof(int)); // output: a scalar
  // initialize h_idata randomly
  for(int i=0; i<size; i++) {
    // h_idata[i] = (int)(rand() % 256);
    h_idata[i] = 1;
  }

  int* d_idata = nullptr; // input: an array
  int* d_odata = nullptr; // output: a scalar
  cudaMalloc((void**)&d_idata, bytes);
  cudaMalloc((void**)&d_odata, sizeof(int));
  // copy data to device from host
  cudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice);
  
  int blocksize = 64;
  dim3 block(blocksize);
  dim3 grid((size + block.x - 1) / block.x); // roll up to an integer

  // <<<16, 64>>>
  // reduction1<<<grid, block>>>(d_idata, d_odata, size);
  // reduction2<<<grid, block>>>(d_idata, d_odata, size);
  reduction3<<<grid, block>>>(d_idata, d_odata, size);
  // synchronize host and device
  cudaDeviceSynchronize();
  cudaMemcpy(h_odata, d_odata, sizeof(int), cudaMemcpyDeviceToHost);
  
  printf("output from gpu: %d\n", *h_odata);
  printf("output from cpu: %d\n", sum(h_idata, size));

  return 0;
}

1 2	~/$ nvcc ./reduction.cu -o reduction ~/$ ./reduction

reduction3有问题，会输出960，992，1024

convolution

CUDA笔记-卷积计算
 CUDA 中的卷积运算

#include<stdio.h>
#include<cuda.h>

// convolution on cpu
void conv1d_cpu(float* input, float* output, int size, float* kernal, int kernal_size) {
  // stride = 1, padding = kernal_size/2
  for(int i=0; i<size; i++) {
    output[i] = 0;
    int start = i - kernal_size/2;
    int end = start + kernal_size;
    if(start < 0 || end > size) continue;
    for(int j=start; j<end; j++) {
      output[i] += input[j] * kernal[j-start];
    }
  }
}

__global__ void conv1d_gpu1(float *input, float *output, int size, float* kernal, int kernal_size) {
  // stride = 1, zero_padding = kernal_size/2
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // tid -> output[tid]
  output[tid] = 0;
  int start = tid - kernal_size/2;
  int end = start + kernal_size;
  if(start <0 || end > size) return;
  for(int i=start; i<end; i++) {
    output[tid] += input[i] * kernal[i-start];
  }
}

#define KERNAL_SIZE 5
__constant__ float KERNAL[KERNAL_SIZE];

__global__ void conv1d_gpu2(float *input, float *output, int size) {
  // thread index
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // tid -> output[tid]
  output[tid] = 0;
  int start = tid - KERNAL_SIZE/2;
  int end = start + KERNAL_SIZE;
  if(start < 0 || end > size) return;
  for(int i=start; i<end; i++) {
    output[tid] += input[i] * KERNAL[i - start];
  }
}

// constant memory
#define SIZE 1024
__global__ void conv1d_gpu3(float *input, float *output, int size, float *kernal, int kernal_size) {
  // thread index
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // shared memory
  __shared__ float shared_input[SIZE]; // expression must have a constant value
  shared_input[tid] = input[tid];

  int start = tid - kernal_size/2;
  int end = start + kernal_size;
  if(start < 0 || end > size) return;
  for(int i=start; i<end; i++) {
    output[tid] += shared_input[i] * kernal[i-start];
  }
}

int main() {

  /* CPU */
  int size = 1024;
  int kernal_size = 5;
  float* h_input = (float*)malloc(size * sizeof(float));
  float* h_output = (float*)malloc(size * sizeof(float));
  float* h_kernal = (float*)malloc(kernal_size * sizeof(float));

  for(int i=0; i<size; i++) h_input[i] = 1.0f;
  for(int i=0; i<kernal_size; i++) h_kernal[i] = 1.0f;

  conv1d_cpu(h_input, h_output, size, h_kernal, kernal_size);

  for(int i=0; i<10; i++) printf("output[%d] = %f (CPU)\n", i, h_output[i]);

  /* GPU normal*/
  float* d_input = nullptr;
  float* d_output = nullptr;
  float* d_kernal = nullptr;
  cudaMalloc((void**)&d_input, size*sizeof(float));
  cudaMalloc((void**)&d_output, size*sizeof(float));
  cudaMalloc((void**)&d_kernal, kernal_size*sizeof(float));

  cudaMemcpy(d_input, h_input, size*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_kernal, h_kernal, kernal_size*sizeof(float), cudaMemcpyHostToDevice);

  int blocksize = 64;
  dim3 block(blocksize);
  dim3 grid((size + block.x - 1) / block.x); // roll up to an integer

  conv1d_gpu1<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size);
  cudaDeviceSynchronize();
  cudaMemcpy(h_output, d_output, size*sizeof(float), cudaMemcpyDeviceToHost);
  for(int i=0; i<10; i++) printf("output[%d] = %f (GPU 1)\n", i, h_output[i]);

  /* GPU constant memory*/
  cudaMemcpyToSymbol(KERNAL, h_kernal, kernal_size*sizeof(float), cudaMemcpyHostToDevice);
  conv1d_gpu2<<<grid, block>>>(d_input, d_output, size);
  cudaDeviceSynchronize();
  cudaMemcpy(h_output, d_output, size*sizeof(float), cudaMemcpyDeviceToHost);
  for(int i=0; i<10; i++) printf("output[%d] = %f (GPU 2)\n", i, h_output[i]);

  /* GPU shared memory */
  conv1d_gpu3<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size);
  cudaDeviceSynchronize();
  cudaMemcpy(h_output, d_output, size*sizeof(float), cudaMemcpyDeviceToHost);
  for(int i=0; i<10; i++) printf("output[%d] = %f (GPU 3)\n", i, h_output[i]);

  return 0;
}

convolution2d

#include <stdio.h>
#include <cuda.h>

// square input and square kernal
__global__ void conv2d_gpu(float *input, float *output, int size, float *kernal, int kernal_size) {
  // stride = 1, zero_padding = kernal_size/2
  // thread index: idx_x -> col, idx_y -> row
  int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
  int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
  int start_row = idx_y - kernal_size/2;
  int end_row = start_row + kernal_size;
  int start_col = idx_x - kernal_size/2;
  int end_col = start_col + kernal_size;
  // initialize
  output[idx_y*size + idx_x] = 0.0f;
  // boundary check
  if(start_row < 0 || start_col < 0 || end_row > size || end_col > size) return;
  for(int row=start_row; row<end_row; row++) {
    for(int col=start_col; col<end_col; col++) {
      // output[idx_y*size + idx_x] += input[row * size + col] * kernal[(row-start_row)*kernal_size + (col-start_col)];
      atomicAdd(&output[idx_y*size + idx_x], input[row * size + col] * kernal[(row-start_row)*kernal_size + (col-start_col)]);
    }
  }
}

int main(){
  int size = 64, kernal_size = 5;
  int bytes = size * size * sizeof(float), kernal_bytes = kernal_size * kernal_size * sizeof(float);

  float *h_input = (float*)malloc(bytes);
  float *h_output = (float*)malloc(bytes);
  float *h_kernal = (float*)malloc(kernal_bytes);

  // initialize input and kernal
  for(int i = 0; i < size*size; i++) h_input[i] = 1.0f;
  for(int i = 0; i < kernal_size*kernal_size; i++) h_kernal[i] = 1.0f;

  // device
  float *d_input = nullptr;
  float *d_output = nullptr;
  float *d_kernal = nullptr;
  cudaMalloc((void**)&d_input, bytes);
  cudaMalloc((void**)&d_output, bytes);
  cudaMalloc((void**)&d_kernal, kernal_bytes);

  // transfer values from host to device
  cudaMemcpy(d_input, h_input, bytes, cudaMemcpyHostToDevice);
  cudaMemcpy(d_kernal, h_kernal, kernal_bytes, cudaMemcpyHostToDevice);

  int blocksize = 16;
  dim3 block(blocksize, blocksize, 1);
  dim3 grid((size + block.x - 1)/block.x, (size + block.y - 1)/block.y, 1);
  conv2d_gpu<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size);
  cudaDeviceSynchronize();

  // transfer output results from device to host
  cudaMemcpy(h_output, d_output, bytes, cudaMemcpyDeviceToHost);
  cudaDeviceSynchronize();
  for(int i = 0; i<10; i++) {
    for(int j = 0; j<10; j++) {
      printf("%.1f, ", h_output[i*size + j]);
    }
    printf("\n");
  }

  return 0;
}