参考

高性能计算方向面试问题总结 | zhihu

C++

原子操作

【C++11 多线程】原子操作(六)

深度学习模块

如何理解过拟合

数据集类比于约束项,模型参数类比于自由度。约束项越少,自由度约大,就越不容易确定一个具体的模型。例如只有两个点的数据,希望得到一个线性模型。如果用 ax + b, 两个参数的模型去拟合是刚刚好的;如果用 ax^2 + bx + c, 三个参数的模型去拟合就会出现多种情况,不一定会得到 a=0 的线性情况。此时的解决方法,一方面是提高数据的多样性,增加数据量;另一方面是对模型加以限制,让它能够更偏向于理想的模型。

关于参数更新次数

模型训练过程中, 模型参数的更新次数取决于当前参数到全局/局部最优点的"距离". 同一个模型, 针对简单的数据集, 很有可能更新了非常多的轮次 loss 还在继续下降, 因为还没有达到全局最优. 而针对复杂的数据集, 很有可能更新很少的轮次 loss 就平稳了, 因为陷入了局部最优解.

如何理解激活函数

为什么需要激活函数?

  1. 基于非线性的考虑 (例如非线性激活函数 Sigmoid)
  2. 基于条件分支的考虑, 使用函数来实现if-else的分支效果 (例如分段线性函数ReLu, 单位阶跃函数)

Batch Normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
from torch import nn

def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# 通过is_grad_enabled来判断当前模式是训练模式还是预测模式
if not torch.is_grad_enabled():
# 如果是在预测模式下,直接使用传入的移动平均所得的均值和方差
# moving_mean 和 moving_var 是全局的均值和方差
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# shape: [B, C]
# 使用全连接层的情况,计算特征维上的均值和方差
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# shape: [B, C, H, W]
# 使用二维卷积层的情况,计算通道维上(axis=1)的均值和方差
# 这里我们需要保持X的形状以便后面可以做广播运算
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
# 训练模式下,用当前的均值和方差做标准化
X_hat = (X - mean) / torch.sqrt(var + eps)
# 更新移动平均的均值和方差
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # 缩放和移位
return Y, moving_mean.data, moving_var.data

class BatchNorm(nn.Module):
# num_features:完全连接层的输出数量或卷积层的输出通道数。
# num_dims:2表示完全连接层,4表示卷积层
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# 参与求梯度和迭代的拉伸和偏移参数,分别初始化成1和0
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# 非模型参数的变量初始化为0和1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)

def forward(self, X):
# 如果X不在内存上,将moving_mean和moving_var
# 复制到X所在显存上
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# 保存更新过的moving_mean和moving_var
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y

if __name__ == "__main__":

B, C, H, W = 8, 3, 224, 224
input = torch.rand(B, C, H, W)
batch_norm2d = BatchNorm(num_features=3, num_dims=4)
output = batch_norm2d(input)
print(output.shape)

Layer Normalization

对于 (B, C, H*W) 的输入, LN 是在 (C, H*W) 的维度上做正则化的, 即 mean(dim=(-2,-1))std(dim=(-2,-1))

Dropout

Dropout 只在 Training 阶段使用, 在推理阶段不使用. 其属于一种对输入或者中间结果的扰动方法, 随机将其中的一些元素置为 0, 以此来提高模型的鲁棒性.

在一轮 forward 中,dropout 随机将一些中间结果置 0,这些被置 0 的元素对应的权重在此轮 backward 中不更新 (梯度为 0),相应没被置 0 的元素对应的权重会得到更多的训练,这会让真正发挥作用的权重得到相对更多的训练,这是 dropout 有效的解释

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import torch
from torch import nn

class Dropout(nn.Module):
def __init__(self, p: float = 0.5) -> None:
super().__init__()
if p < 0 or p > 1:
raise ValueError(f"dropout probability has to be between 0 and 1, but got {p}")
self.p = p

def forward(self, x):
if self.training is False:
return x
if self.p == 1:
return torch.zeros_like(x)
if self.p == 0:
return x
mask = (torch.rand(x.shape) > self.p).float()
''' why /(1.0 - self.p)? : sum(x) == sum(dropout(x))'''
return mask * x / (1.0 - self.p)

if __name__ == "__main__":

B, T, C = 1, 3, 4
input = torch.arange(B*T*C).reshape(B, T, C)
print(input)

model = Dropout(p=0.5)

model.train()
print(f"self.training: {model.training}")
output_training = model(input)
print(output_training)

model.eval()
print(f"self.training: {model.training}")
output_evaluate = model(input)
print(output_evaluate)

'''
tensor([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]])
self.training: True
tensor([[[ 0., 2., 0., 6.],
[ 0., 0., 12., 14.],
[ 0., 0., 20., 0.]]])
self.training: False
tensor([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]])
'''

Softmax

实现一个 Softmax, 其中 Softmax 的计算公式如下

softmax([x1,,xn])=[exp(x1)inexp(xi),,exp(xn)inexp(xi)]\text{softmax}([x_1,\cdots,x_n]) = \left[\frac{\exp(x_1)}{\sum_i^n\exp(x_i)}, \cdots, \frac{\exp(x_n)}{\sum_i^n\exp(x_i)}\right]

但在实际实现中, 我们将 [x1,,xn][x_1,\cdots,x_n] 减去 max(x1,,xn)\max(x_1,\cdots,x_n) 再取 exp\exp. 根据 Softmax 的平移不变性, 我们会得到相同的结果

exp(ximax)inexp(ximax)=exp(xi)/exp(max)inexp(xi)/exp(max)=exp(xi)inexp(xi)\frac{\exp(x_i - \max)}{\sum_i^n\exp(x_i - \max)} = \frac{\exp(x_i)/\exp(\max)}{\sum_i^n\exp(x_i)/\exp(\max)} = \frac{\exp(x_i)}{\sum_i^n\exp(x_i)}

这样做的好处是防止溢出. 对于比较大的 x 而言, 再取 exp(x)\exp(x) 有可能会上溢出. 通过减去一个 max\max, 将 x 的值都平移到 (,0](-\infty, 0], 这样 exp(x)\exp(x) 将会属于 (0,1](0, 1], 就不会发生溢出.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import torch
from torch import nn

class Softmax(nn.Module):
def __init__(self, dim=0):
super().__init__()
self.dim = dim

def forward(self, x):
x_max = torch.max(x, dim=self.dim, keepdim=True)[0]
x_exp = torch.exp(x-x_max)
x_exp_sum = torch.sum(x, dim=self.dim, keepdim=True)
output = x_exp/x_exp_sum
return output

if __name__ == "__main__":
B, T, C = 8, 192, 768
input = torch.rand(B, T, C)
output = Softmax(dim=-1)(input)
print(output.shape)

Self-Attention

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import torch
from torch import nn

'''
x.shape: [n, dim_in]
Q = x @ Wq, K = x @ Wk, V = x @ Wv
attention = softmax((Q @ K^T)/sqrt(dim_k)) @ V
'''
class SelfAttention(nn.Module):
def __init__(self, dim_in, dim_q, dim_k, dim_v):
super(SelfAttention, self).__init__()
assert dim_k == dim_q # dim_k == dim_q
self.dim_in = dim_in
self.dim_q = dim_k
self.dim_k = dim_k
self.dim_v = dim_v
self.linear_q = nn.Linear(dim_in, dim_q, bias=False) # Wq
self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
self.norm = (dim_k)**(1/2)

def forward(self, x):
'''x: n, dim_in'''
assert x.shape[-1] == self.dim_in
q = self.linear_q(x) # q = x @ Wq
k = self.linear_k(x) # k = x @ Wk
v = self.linear_v(x) # v = X @ Wv

# Q:(B, T, dim_q) @ K^T:(B, dim_k, T) -> Q@K^T:(B, T, T)
attention = torch.matmul(q, k.transpose(-2,-1)) / self.norm
attention = nn.Softmax(-1)(attention)
# Q@K^T:(B, T, T) @ V:(B, T, dim_v) -> (Q@K^T)@V:(B, T, dim_v)
attention = torch.matmul(attention, v)

return attention

if __name__ == "__main__":
# Batch_size, Time_step, Embedding_size
B, T, C = 8, 12, 4
input = torch.rand(B, T, C)
# input:(B, T, C) -> Q:(B, T, dim_q), K:(B, T, dim_k), V:(B, T, dim_v)
attention = SelfAttention(dim_in=C, dim_q=16, dim_k=16, dim_v=32)
output = attention.forward(input)
print(output.shape)

MultiHead-Attention

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
from torch import nn

'''
x.shape: [B, T, C]
Q = x @ Wq; K = x @ Wk; V = x @ Wv
Q.shape: [B, T, dim_q]; K.shape: [B, T, dim_k]; V.shape: [B, T, dim_v]
Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh]
Q1-Qh: [B, T, dim_q/h]; K1-Kh: [B, T, dim_k/h]; V1-Vh: [B, T, dim_v/h]
concat[attention(Q1,K1,V1), ..., attention(Qh,Kh,Vh)]
attention(Q1,K1,V1)-attention(Qh,Kh,Vh): [B, T, dim_v/h]
'''
class MultiHeadSelfAttention(nn.Module):
def __init__(self, dim_in, dim_q, dim_k, dim_v, num_heads=8):
super(MultiHeadSelfAttention, self).__init__()
assert dim_q == dim_k # dim_q == dim_k
self.dim_in = dim_in
self.dim_q = dim_q
self.dim_k = dim_k
self.dim_v = dim_v
self.num_heads = num_heads
self.linear_q = nn.Linear(dim_in, dim_q, bias=False) # Wq
self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
self.norm_fact = (dim_k // num_heads)**(1/2)

def forward(self, x):
'''x.shape: [B, T, dim_in]'''
B, T, dim_in = x.shape
assert dim_in == self.dim_in
dim_q = self.dim_q
dim_k = self.dim_k
dim_v = self.dim_v
heads = self.num_heads

q = self.linear_q(x).reshape(B, T, heads, dim_q//heads).transpose(1,2) # (B, heads, T, dim_q//heads)
k = self.linear_k(x).reshape(B, T, heads, dim_k//heads).transpose(1,2) # (B, heads, T, dim_k//heads)
v = self.linear_v(x).reshape(B, T, heads, dim_v//heads).transpose(1,2) # (B, heads, T, dim_v//heads)

# Q:[B, h, T, dim_q//h] @ K^T: [B, h, dim_k//h, T] -> Q@K^T: [B, h, T, T]
attention = torch.matmul(q, k.transpose(-2,-1)) / self.norm_fact
attention = nn.Softmax(dim=-1)(attention)
# Q@K^T: [B, h, T, T] @ V: [B, h, T, dim_v//h] -> (Q@K^T)@V: [B, h, T, dim_v//h]
attention = torch.matmul(attention, v)
# concat: [B, h, T, dim_v//h] -> [B, T, h, dim_v//h] -> (B, T, dim_v)
attention = attention.transpose(1,2).reshape(B, T, dim_v)
return attention

if __name__ == "__main__":
# Batch_size, Time_step, Embedding_size
B, T, C = 8, 12, 4
input = torch.rand(B, T, C)
# input:(B, T, C) -> Q:(B, T, dim_q), K:(B, T, dim_k), V:(B, T, dim_v)
# Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh]
# Q1-Qh: [B, T, dim_q/h]; K1-Kh: [B, T, dim_k/h]; V1-Vh: [B, T, dim_v/h]
multihead = MultiHeadSelfAttention(dim_in=C, dim_q=16, dim_k=16, dim_v=32, num_heads=8)
output = multihead.forward(input)
print(output.shape)

PatchEmbedding in ViT

Vision Transformer 通过 Patch Embedding 操作将一个 CxHxW 的平面图像转换成一个 [(H//P)*(W//P), C*P*P] 大小的 “Token” 序列数据, 这样可以正常输入到 MultiHeadSelfAttention 中.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
from torch import nn

class PatchEmbedding(nn.Module):
'''
2D Image to token embedding
'''
def __init__(self, channel, height, width, patch_size):
super().__init__()
self.C = channel
self.H = height
self.W = width
self.P = patch_size
self.embed_dim = channel * patch_size * patch_size
'''Out = ⌊(In - Kernal + 2*Padding) / Stride⌋ + 1'''
'''[B, C, H, W] -project-> [B, C*P*P, H//P, W//P]'''
self.proj = nn.Conv2d(in_channels=channel, out_channels=self.embed_dim,
kernel_size=patch_size, stride=patch_size)

def forward(self, x):
B, C, H, W = x.shape
assert H == self.H and W == self.W, \
f"input shape {H}x{W} dosen't match model shape {self.H}x{self.W}"
# [B, C, H, W] -project-> [B, C*P*P, H//P, W//P]
x = self.proj(x)
# [B, C*P*P, H//P, W//P] -flatten-> [B, C*P*P, (H//P)*(W//P)]
x = x.flatten(start_dim=-2)
# [B, C*P*P, (H//P)*(W//P)] -transpose-> [B, (H//P)*(W//P), C*P*P]
x = x.transpose(-2, -1)
return x

if __name__ == "__main__":
# Batch_size, Channel, Height, Width, Patch_size
B, C, H, W, P = 8, 3, 224, 224, 16
input = torch.rand(B, C, H, W)
'''
(H//P)*(W//P): the number of patches, C*P*P: the embedding dimention of a patch
[B, C, H, W] -> [B, (H//P)*(W//P), C*P*P]
[8, 3, 224, 224] -> [8, 14*14, 3*16*16] = [8, 196, 768]
'''
embedding = PatchEmbedding(C, H, W, P)
output = embedding(input)
print(output.shape)

CUDA

reduction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
#include <stdio.h>
#include <cuda.h>

int sum(const int* idata, int n) {
int result = 0;
for(int i=0; i<n; i++) {
result += idata[i];
}
return result;
}

__global__ void reduction1(int *idata, int *odata, int n) {
// thread index
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// boundary check
if(tid >= n) return;

/*
idx: 0, 1, 2, 3, 4, 5, 6, 7,
tid: 0 0, 2 2, 4 4, 6 6,
tid: 0 0, 4 4,
tod: 0 0,
*/
for(int stride = 1; tid+stride < n; stride *= 2) {
// stride = 1, 2, 4, 8, ...
if((tid % (2 * stride)) == 0) {
idata[tid] = idata[tid] + idata[tid + stride];
}
// synchronize
__syncthreads();
}
// get the result
if(tid==0) *odata = idata[0];
}

__global__ void reduction2(int *idata, int *odata, int n) {
// thread index
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// boundary check
if(tid >= n) return;

/*
idx: 0, 1, 2, 3, 4, 5, 6, 7,
tid: 0 0, 1 1, 2 2, 3 3,
tid: 0 0, 1 1,
tid: 0 0,
*/
for(int stride=1; stride<n; stride*=2) {
int index = 2 * stride * tid;
if(index + stride < n)
idata[index] += idata[index + stride];
// synchronize threads
__syncthreads();
}
// get the result
if(tid == 0) *odata = idata[0];
}

__global__ void reduction3(int *idata, int *odata, int n) {
// thread index
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// boundary check
if(tid >= n) return;

/*
idx: 0, 1, 2, 3, 4, 5, 6, 7,
tid: 0, 1, 2, 3, 0, 1, 2, 3
tid: 0, 1, 0, 1,
tid: 0, 0
*/

for(int stride=n/2; stride>=1; stride/=2) {
if(tid < stride) {
// idata[tid] += idata[tid + stride];
atomicAdd(&idata[tid], idata[tid + stride]);
}
// synchronize threads
__syncthreads();
}
// get the result
if(tid == 0) *odata = idata[0];
}

int main() {
int size = 1<<10; // 2^10 = 1024: total number of elements
int bytes = size * sizeof(int); // bytes for int array

int* h_idata = (int*)malloc(bytes); // input: an array
int* h_odata = (int*)malloc(sizeof(int)); // output: a scalar
// initialize h_idata randomly
for(int i=0; i<size; i++) {
// h_idata[i] = (int)(rand() % 256);
h_idata[i] = 1;
}

int* d_idata = nullptr; // input: an array
int* d_odata = nullptr; // output: a scalar
cudaMalloc((void**)&d_idata, bytes);
cudaMalloc((void**)&d_odata, sizeof(int));
// copy data to device from host
cudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice);

int blocksize = 64;
dim3 block(blocksize);
dim3 grid((size + block.x - 1) / block.x); // roll up to an integer

// <<<16, 64>>>
// reduction1<<<grid, block>>>(d_idata, d_odata, size);
// reduction2<<<grid, block>>>(d_idata, d_odata, size);
reduction3<<<grid, block>>>(d_idata, d_odata, size);
// synchronize host and device
cudaDeviceSynchronize();
cudaMemcpy(h_odata, d_odata, sizeof(int), cudaMemcpyDeviceToHost);

printf("output from gpu: %d\n", *h_odata);
printf("output from cpu: %d\n", sum(h_idata, size));

return 0;
}
1
2
~/$ nvcc ./reduction.cu -o reduction
~/$ ./reduction

reduction3有问题,会输出960,992,1024

convolution

CUDA笔记-卷积计算
CUDA 中的卷积运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
#include<stdio.h>
#include<cuda.h>

// convolution on cpu
void conv1d_cpu(float* input, float* output, int size, float* kernal, int kernal_size) {
// stride = 1, padding = kernal_size/2
for(int i=0; i<size; i++) {
output[i] = 0;
int start = i - kernal_size/2;
int end = start + kernal_size;
if(start < 0 || end > size) continue;
for(int j=start; j<end; j++) {
output[i] += input[j] * kernal[j-start];
}
}
}

__global__ void conv1d_gpu1(float *input, float *output, int size, float* kernal, int kernal_size) {
// stride = 1, zero_padding = kernal_size/2
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// tid -> output[tid]
output[tid] = 0;
int start = tid - kernal_size/2;
int end = start + kernal_size;
if(start <0 || end > size) return;
for(int i=start; i<end; i++) {
output[tid] += input[i] * kernal[i-start];
}
}

#define KERNAL_SIZE 5
__constant__ float KERNAL[KERNAL_SIZE];

__global__ void conv1d_gpu2(float *input, float *output, int size) {
// thread index
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// tid -> output[tid]
output[tid] = 0;
int start = tid - KERNAL_SIZE/2;
int end = start + KERNAL_SIZE;
if(start < 0 || end > size) return;
for(int i=start; i<end; i++) {
output[tid] += input[i] * KERNAL[i - start];
}
}

// constant memory
#define SIZE 1024
__global__ void conv1d_gpu3(float *input, float *output, int size, float *kernal, int kernal_size) {
// thread index
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// shared memory
__shared__ float shared_input[SIZE]; // expression must have a constant value
shared_input[tid] = input[tid];

int start = tid - kernal_size/2;
int end = start + kernal_size;
if(start < 0 || end > size) return;
for(int i=start; i<end; i++) {
output[tid] += shared_input[i] * kernal[i-start];
}
}

int main() {

/* CPU */
int size = 1024;
int kernal_size = 5;
float* h_input = (float*)malloc(size * sizeof(float));
float* h_output = (float*)malloc(size * sizeof(float));
float* h_kernal = (float*)malloc(kernal_size * sizeof(float));

for(int i=0; i<size; i++) h_input[i] = 1.0f;
for(int i=0; i<kernal_size; i++) h_kernal[i] = 1.0f;

conv1d_cpu(h_input, h_output, size, h_kernal, kernal_size);

for(int i=0; i<10; i++) printf("output[%d] = %f (CPU)\n", i, h_output[i]);

/* GPU normal*/
float* d_input = nullptr;
float* d_output = nullptr;
float* d_kernal = nullptr;
cudaMalloc((void**)&d_input, size*sizeof(float));
cudaMalloc((void**)&d_output, size*sizeof(float));
cudaMalloc((void**)&d_kernal, kernal_size*sizeof(float));

cudaMemcpy(d_input, h_input, size*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_kernal, h_kernal, kernal_size*sizeof(float), cudaMemcpyHostToDevice);

int blocksize = 64;
dim3 block(blocksize);
dim3 grid((size + block.x - 1) / block.x); // roll up to an integer

conv1d_gpu1<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size);
cudaDeviceSynchronize();
cudaMemcpy(h_output, d_output, size*sizeof(float), cudaMemcpyDeviceToHost);
for(int i=0; i<10; i++) printf("output[%d] = %f (GPU 1)\n", i, h_output[i]);

/* GPU constant memory*/
cudaMemcpyToSymbol(KERNAL, h_kernal, kernal_size*sizeof(float), cudaMemcpyHostToDevice);
conv1d_gpu2<<<grid, block>>>(d_input, d_output, size);
cudaDeviceSynchronize();
cudaMemcpy(h_output, d_output, size*sizeof(float), cudaMemcpyDeviceToHost);
for(int i=0; i<10; i++) printf("output[%d] = %f (GPU 2)\n", i, h_output[i]);

/* GPU shared memory */
conv1d_gpu3<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size);
cudaDeviceSynchronize();
cudaMemcpy(h_output, d_output, size*sizeof(float), cudaMemcpyDeviceToHost);
for(int i=0; i<10; i++) printf("output[%d] = %f (GPU 3)\n", i, h_output[i]);

return 0;
}

convolution2d

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <stdio.h>
#include <cuda.h>

// square input and square kernal
__global__ void conv2d_gpu(float *input, float *output, int size, float *kernal, int kernal_size) {
// stride = 1, zero_padding = kernal_size/2
// thread index: idx_x -> col, idx_y -> row
int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
int start_row = idx_y - kernal_size/2;
int end_row = start_row + kernal_size;
int start_col = idx_x - kernal_size/2;
int end_col = start_col + kernal_size;
// initialize
output[idx_y*size + idx_x] = 0.0f;
// boundary check
if(start_row < 0 || start_col < 0 || end_row > size || end_col > size) return;
for(int row=start_row; row<end_row; row++) {
for(int col=start_col; col<end_col; col++) {
// output[idx_y*size + idx_x] += input[row * size + col] * kernal[(row-start_row)*kernal_size + (col-start_col)];
atomicAdd(&output[idx_y*size + idx_x], input[row * size + col] * kernal[(row-start_row)*kernal_size + (col-start_col)]);
}
}
}

int main(){
int size = 64, kernal_size = 5;
int bytes = size * size * sizeof(float), kernal_bytes = kernal_size * kernal_size * sizeof(float);

float *h_input = (float*)malloc(bytes);
float *h_output = (float*)malloc(bytes);
float *h_kernal = (float*)malloc(kernal_bytes);

// initialize input and kernal
for(int i = 0; i < size*size; i++) h_input[i] = 1.0f;
for(int i = 0; i < kernal_size*kernal_size; i++) h_kernal[i] = 1.0f;

// device
float *d_input = nullptr;
float *d_output = nullptr;
float *d_kernal = nullptr;
cudaMalloc((void**)&d_input, bytes);
cudaMalloc((void**)&d_output, bytes);
cudaMalloc((void**)&d_kernal, kernal_bytes);

// transfer values from host to device
cudaMemcpy(d_input, h_input, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_kernal, h_kernal, kernal_bytes, cudaMemcpyHostToDevice);

int blocksize = 16;
dim3 block(blocksize, blocksize, 1);
dim3 grid((size + block.x - 1)/block.x, (size + block.y - 1)/block.y, 1);
conv2d_gpu<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size);
cudaDeviceSynchronize();

// transfer output results from device to host
cudaMemcpy(h_output, d_output, bytes, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
for(int i = 0; i<10; i++) {
for(int j = 0; j<10; j++) {
printf("%.1f, ", h_output[i*size + j]);
}
printf("\n");
}

return 0;
}