参考
高性能计算方向面试问题总结 | zhihu
C++
原子操作
【C++11 多线程】原子操作(六)
深度学习模块
如何理解过拟合
数据集类比于约束项,模型参数类比于自由度。约束项越少,自由度约大,就越不容易确定一个具体的模型。例如只有两个点的数据,希望得到一个线性模型。如果用 ax + b, 两个参数的模型去拟合是刚刚好的;如果用 ax^2 + bx + c, 三个参数的模型去拟合就会出现多种情况,不一定会得到 a=0 的线性情况。此时的解决方法,一方面是提高数据的多样性,增加数据量;另一方面是对模型加以限制,让它能够更偏向于理想的模型。
关于参数更新次数
模型训练过程中, 模型参数的更新次数取决于当前参数到全局/局部最优点的"距离". 同一个模型, 针对简单的数据集, 很有可能更新了非常多的轮次 loss 还在继续下降, 因为还没有达到全局最优. 而针对复杂的数据集, 很有可能更新很少的轮次 loss 就平稳了, 因为陷入了局部最优解.
如何理解激活函数
为什么需要激活函数?
基于非线性的考虑 (例如非线性激活函数 Sigmoid)
基于条件分支的考虑, 使用函数来实现if-else的分支效果 (例如分段线性函数ReLu, 单位阶跃函数)
Batch Normalization
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 import torchfrom torch import nndef batch_norm (X, gamma, beta, moving_mean, moving_var, eps, momentum ): if not torch.is_grad_enabled(): X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps) else : assert len (X.shape) in (2 , 4 ) if len (X.shape) == 2 : mean = X.mean(dim=0 ) var = ((X - mean) ** 2 ).mean(dim=0 ) else : mean = X.mean(dim=(0 , 2 , 3 ), keepdim=True ) var = ((X - mean) ** 2 ).mean(dim=(0 , 2 , 3 ), keepdim=True ) X_hat = (X - mean) / torch.sqrt(var + eps) moving_mean = momentum * moving_mean + (1.0 - momentum) * mean moving_var = momentum * moving_var + (1.0 - momentum) * var Y = gamma * X_hat + beta return Y, moving_mean.data, moving_var.data class BatchNorm (nn.Module): def __init__ (self, num_features, num_dims ): super ().__init__() if num_dims == 2 : shape = (1 , num_features) else : shape = (1 , num_features, 1 , 1 ) self.gamma = nn.Parameter(torch.ones(shape)) self.beta = nn.Parameter(torch.zeros(shape)) self.moving_mean = torch.zeros(shape) self.moving_var = torch.ones(shape) def forward (self, X ): if self.moving_mean.device != X.device: self.moving_mean = self.moving_mean.to(X.device) self.moving_var = self.moving_var.to(X.device) Y, self.moving_mean, self.moving_var = batch_norm( X, self.gamma, self.beta, self.moving_mean, self.moving_var, eps=1e-5 , momentum=0.9 ) return Y if __name__ == "__main__" : B, C, H, W = 8 , 3 , 224 , 224 input = torch.rand(B, C, H, W) batch_norm2d = BatchNorm(num_features=3 , num_dims=4 ) output = batch_norm2d(input ) print (output.shape)
Layer Normalization
对于 (B, C, H*W) 的输入, LN 是在 (C, H*W) 的维度上做正则化的, 即 mean(dim=(-2,-1))
和 std(dim=(-2,-1))
Dropout
Dropout 只在 Training 阶段使用, 在推理阶段不使用. 其属于一种对输入或者中间结果的扰动方法, 随机将其中的一些元素置为 0, 以此来提高模型的鲁棒性.
在一轮 forward 中,dropout 随机将一些中间结果置 0,这些被置 0 的元素对应的权重在此轮 backward 中不更新 (梯度为 0),相应没被置 0 的元素对应的权重会得到更多的训练,这会让真正发挥作用的权重得到相对更多的训练,这是 dropout 有效的解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 import torchfrom torch import nnclass Dropout (nn.Module): def __init__ (self, p: float = 0.5 ) -> None : super ().__init__() if p < 0 or p > 1 : raise ValueError(f"dropout probability has to be between 0 and 1, but got {p} " ) self.p = p def forward (self, x ): if self.training is False : return x if self.p == 1 : return torch.zeros_like(x) if self.p == 0 : return x mask = (torch.rand(x.shape) > self.p).float () ''' why /(1.0 - self.p)? : sum(x) == sum(dropout(x))''' return mask * x / (1.0 - self.p) if __name__ == "__main__" : B, T, C = 1 , 3 , 4 input = torch.arange(B*T*C).reshape(B, T, C) print (input ) model = Dropout(p=0.5 ) model.train() print (f"self.training: {model.training} " ) output_training = model(input ) print (output_training) model.eval () print (f"self.training: {model.training} " ) output_evaluate = model(input ) print (output_evaluate) ''' tensor([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]]) self.training: True tensor([[[ 0., 2., 0., 6.], [ 0., 0., 12., 14.], [ 0., 0., 20., 0.]]]) self.training: False tensor([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]]) '''
Softmax
实现一个 Softmax, 其中 Softmax 的计算公式如下
softmax ( [ x 1 , ⋯ , x n ] ) = [ exp ( x 1 ) ∑ i n exp ( x i ) , ⋯ , exp ( x n ) ∑ i n exp ( x i ) ] \text{softmax}([x_1,\cdots,x_n]) = \left[\frac{\exp(x_1)}{\sum_i^n\exp(x_i)}, \cdots, \frac{\exp(x_n)}{\sum_i^n\exp(x_i)}\right]
softmax ([ x 1 , ⋯ , x n ]) = [ ∑ i n exp ( x i ) exp ( x 1 ) , ⋯ , ∑ i n exp ( x i ) exp ( x n ) ]
但在实际实现中, 我们将 [ x 1 , ⋯ , x n ] [x_1,\cdots,x_n] [ x 1 , ⋯ , x n ] 减去 max ( x 1 , ⋯ , x n ) \max(x_1,\cdots,x_n) max ( x 1 , ⋯ , x n ) 再取 exp \exp exp . 根据 Softmax 的平移不变性, 我们会得到相同的结果
exp ( x i − max ) ∑ i n exp ( x i − max ) = exp ( x i ) / exp ( max ) ∑ i n exp ( x i ) / exp ( max ) = exp ( x i ) ∑ i n exp ( x i ) \frac{\exp(x_i - \max)}{\sum_i^n\exp(x_i - \max)} =
\frac{\exp(x_i)/\exp(\max)}{\sum_i^n\exp(x_i)/\exp(\max)} =
\frac{\exp(x_i)}{\sum_i^n\exp(x_i)}
∑ i n exp ( x i − max ) exp ( x i − max ) = ∑ i n exp ( x i ) / exp ( max ) exp ( x i ) / exp ( max ) = ∑ i n exp ( x i ) exp ( x i )
这样做的好处是防止溢出. 对于比较大的 x 而言, 再取 exp ( x ) \exp(x) exp ( x ) 有可能会上溢出. 通过减去一个 max \max max , 将 x 的值都平移到 ( − ∞ , 0 ] (-\infty, 0] ( − ∞ , 0 ] , 这样 exp ( x ) \exp(x) exp ( x ) 将会属于 ( 0 , 1 ] (0, 1] ( 0 , 1 ] , 就不会发生溢出.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import torchfrom torch import nnclass Softmax (nn.Module): def __init__ (self, dim=0 ): super ().__init__() self.dim = dim def forward (self, x ): x_max = torch.max (x, dim=self.dim, keepdim=True )[0 ] x_exp = torch.exp(x-x_max) x_exp_sum = torch.sum (x, dim=self.dim, keepdim=True ) output = x_exp/x_exp_sum return output if __name__ == "__main__" : B, T, C = 8 , 192 , 768 input = torch.rand(B, T, C) output = Softmax(dim=-1 )(input ) print (output.shape)
Self-Attention
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 import torchfrom torch import nn''' x.shape: [n, dim_in] Q = x @ Wq, K = x @ Wk, V = x @ Wv attention = softmax((Q @ K^T)/sqrt(dim_k)) @ V ''' class SelfAttention (nn.Module): def __init__ (self, dim_in, dim_q, dim_k, dim_v ): super (SelfAttention, self).__init__() assert dim_k == dim_q self.dim_in = dim_in self.dim_q = dim_k self.dim_k = dim_k self.dim_v = dim_v self.linear_q = nn.Linear(dim_in, dim_q, bias=False ) self.linear_k = nn.Linear(dim_in, dim_k, bias=False ) self.linear_v = nn.Linear(dim_in, dim_v, bias=False ) self.norm = (dim_k)**(1 /2 ) def forward (self, x ): '''x: n, dim_in''' assert x.shape[-1 ] == self.dim_in q = self.linear_q(x) k = self.linear_k(x) v = self.linear_v(x) attention = torch.matmul(q, k.transpose(-2 ,-1 )) / self.norm attention = nn.Softmax(-1 )(attention) attention = torch.matmul(attention, v) return attention if __name__ == "__main__" : B, T, C = 8 , 12 , 4 input = torch.rand(B, T, C) attention = SelfAttention(dim_in=C, dim_q=16 , dim_k=16 , dim_v=32 ) output = attention.forward(input ) print (output.shape)
MultiHead-Attention
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 import torchfrom torch import nn''' x.shape: [B, T, C] Q = x @ Wq; K = x @ Wk; V = x @ Wv Q.shape: [B, T, dim_q]; K.shape: [B, T, dim_k]; V.shape: [B, T, dim_v] Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh] Q1-Qh: [B, T, dim_q/h]; K1-Kh: [B, T, dim_k/h]; V1-Vh: [B, T, dim_v/h] concat[attention(Q1,K1,V1), ..., attention(Qh,Kh,Vh)] attention(Q1,K1,V1)-attention(Qh,Kh,Vh): [B, T, dim_v/h] ''' class MultiHeadSelfAttention (nn.Module): def __init__ (self, dim_in, dim_q, dim_k, dim_v, num_heads=8 ): super (MultiHeadSelfAttention, self).__init__() assert dim_q == dim_k self.dim_in = dim_in self.dim_q = dim_q self.dim_k = dim_k self.dim_v = dim_v self.num_heads = num_heads self.linear_q = nn.Linear(dim_in, dim_q, bias=False ) self.linear_k = nn.Linear(dim_in, dim_k, bias=False ) self.linear_v = nn.Linear(dim_in, dim_v, bias=False ) self.norm_fact = (dim_k // num_heads)**(1 /2 ) def forward (self, x ): '''x.shape: [B, T, dim_in]''' B, T, dim_in = x.shape assert dim_in == self.dim_in dim_q = self.dim_q dim_k = self.dim_k dim_v = self.dim_v heads = self.num_heads q = self.linear_q(x).reshape(B, T, heads, dim_q//heads).transpose(1 ,2 ) k = self.linear_k(x).reshape(B, T, heads, dim_k//heads).transpose(1 ,2 ) v = self.linear_v(x).reshape(B, T, heads, dim_v//heads).transpose(1 ,2 ) attention = torch.matmul(q, k.transpose(-2 ,-1 )) / self.norm_fact attention = nn.Softmax(dim=-1 )(attention) attention = torch.matmul(attention, v) attention = attention.transpose(1 ,2 ).reshape(B, T, dim_v) return attention if __name__ == "__main__" : B, T, C = 8 , 12 , 4 input = torch.rand(B, T, C) multihead = MultiHeadSelfAttention(dim_in=C, dim_q=16 , dim_k=16 , dim_v=32 , num_heads=8 ) output = multihead.forward(input ) print (output.shape)
PatchEmbedding in ViT
Vision Transformer 通过 Patch Embedding 操作将一个 CxHxW
的平面图像转换成一个 [(H//P)*(W//P), C*P*P]
大小的 “Token” 序列数据, 这样可以正常输入到 MultiHeadSelfAttention 中.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 import torchfrom torch import nnclass PatchEmbedding (nn.Module): ''' 2D Image to token embedding ''' def __init__ (self, channel, height, width, patch_size ): super ().__init__() self.C = channel self.H = height self.W = width self.P = patch_size self.embed_dim = channel * patch_size * patch_size '''Out = ⌊(In - Kernal + 2*Padding) / Stride⌋ + 1''' '''[B, C, H, W] -project-> [B, C*P*P, H//P, W//P]''' self.proj = nn.Conv2d(in_channels=channel, out_channels=self.embed_dim, kernel_size=patch_size, stride=patch_size) def forward (self, x ): B, C, H, W = x.shape assert H == self.H and W == self.W, \ f"input shape {H} x{W} dosen't match model shape {self.H} x{self.W} " x = self.proj(x) x = x.flatten(start_dim=-2 ) x = x.transpose(-2 , -1 ) return x if __name__ == "__main__" : B, C, H, W, P = 8 , 3 , 224 , 224 , 16 input = torch.rand(B, C, H, W) ''' (H//P)*(W//P): the number of patches, C*P*P: the embedding dimention of a patch [B, C, H, W] -> [B, (H//P)*(W//P), C*P*P] [8, 3, 224, 224] -> [8, 14*14, 3*16*16] = [8, 196, 768] ''' embedding = PatchEmbedding(C, H, W, P) output = embedding(input ) print (output.shape)
CUDA
reduction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 #include <stdio.h> #include <cuda.h> int sum (const int * idata, int n) { int result = 0 ; for (int i=0 ; i<n; i++) { result += idata[i]; } return result; } __global__ void reduction1 (int *idata, int *odata, int n) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid >= n) return ; for (int stride = 1 ; tid+stride < n; stride *= 2 ) { if ((tid % (2 * stride)) == 0 ) { idata[tid] = idata[tid] + idata[tid + stride]; } __syncthreads(); } if (tid==0 ) *odata = idata[0 ]; } __global__ void reduction2 (int *idata, int *odata, int n) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid >= n) return ; for (int stride=1 ; stride<n; stride*=2 ) { int index = 2 * stride * tid; if (index + stride < n) idata[index] += idata[index + stride]; __syncthreads(); } if (tid == 0 ) *odata = idata[0 ]; } __global__ void reduction3 (int *idata, int *odata, int n) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid >= n) return ; for (int stride=n/2 ; stride>=1 ; stride/=2 ) { if (tid < stride) { atomicAdd (&idata[tid], idata[tid + stride]); } __syncthreads(); } if (tid == 0 ) *odata = idata[0 ]; } int main () { int size = 1 <<10 ; int bytes = size * sizeof (int ); int * h_idata = (int *)malloc (bytes); int * h_odata = (int *)malloc (sizeof (int )); for (int i=0 ; i<size; i++) { h_idata[i] = 1 ; } int * d_idata = nullptr ; int * d_odata = nullptr ; cudaMalloc ((void **)&d_idata, bytes); cudaMalloc ((void **)&d_odata, sizeof (int )); cudaMemcpy (d_idata, h_idata, bytes, cudaMemcpyHostToDevice); int blocksize = 64 ; dim3 block (blocksize) ; dim3 grid ((size + block.x - 1 ) / block.x) ; reduction3<<<grid, block>>>(d_idata, d_odata, size); cudaDeviceSynchronize (); cudaMemcpy (h_odata, d_odata, sizeof (int ), cudaMemcpyDeviceToHost); printf ("output from gpu: %d\n" , *h_odata); printf ("output from cpu: %d\n" , sum (h_idata, size)); return 0 ; }
1 2 ~/$ nvcc ./reduction.cu -o reduction ~/$ ./reduction
reduction3有问题,会输出960,992,1024
convolution
CUDA笔记-卷积计算
CUDA 中的卷积运算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 #include <stdio.h> #include <cuda.h> void conv1d_cpu (float * input, float * output, int size, float * kernal, int kernal_size) { for (int i=0 ; i<size; i++) { output[i] = 0 ; int start = i - kernal_size/2 ; int end = start + kernal_size; if (start < 0 || end > size) continue ; for (int j=start; j<end; j++) { output[i] += input[j] * kernal[j-start]; } } } __global__ void conv1d_gpu1 (float *input, float *output, int size, float * kernal, int kernal_size) { int tid = blockIdx.x * blockDim.x + threadIdx.x; output[tid] = 0 ; int start = tid - kernal_size/2 ; int end = start + kernal_size; if (start <0 || end > size) return ; for (int i=start; i<end; i++) { output[tid] += input[i] * kernal[i-start]; } } #define KERNAL_SIZE 5 __constant__ float KERNAL[KERNAL_SIZE]; __global__ void conv1d_gpu2 (float *input, float *output, int size) { int tid = blockIdx.x * blockDim.x + threadIdx.x; output[tid] = 0 ; int start = tid - KERNAL_SIZE/2 ; int end = start + KERNAL_SIZE; if (start < 0 || end > size) return ; for (int i=start; i<end; i++) { output[tid] += input[i] * KERNAL[i - start]; } } #define SIZE 1024 __global__ void conv1d_gpu3 (float *input, float *output, int size, float *kernal, int kernal_size) { int tid = blockIdx.x * blockDim.x + threadIdx.x; __shared__ float shared_input[SIZE]; shared_input[tid] = input[tid]; int start = tid - kernal_size/2 ; int end = start + kernal_size; if (start < 0 || end > size) return ; for (int i=start; i<end; i++) { output[tid] += shared_input[i] * kernal[i-start]; } } int main () { int size = 1024 ; int kernal_size = 5 ; float * h_input = (float *)malloc (size * sizeof (float )); float * h_output = (float *)malloc (size * sizeof (float )); float * h_kernal = (float *)malloc (kernal_size * sizeof (float )); for (int i=0 ; i<size; i++) h_input[i] = 1.0f ; for (int i=0 ; i<kernal_size; i++) h_kernal[i] = 1.0f ; conv1d_cpu (h_input, h_output, size, h_kernal, kernal_size); for (int i=0 ; i<10 ; i++) printf ("output[%d] = %f (CPU)\n" , i, h_output[i]); float * d_input = nullptr ; float * d_output = nullptr ; float * d_kernal = nullptr ; cudaMalloc ((void **)&d_input, size*sizeof (float )); cudaMalloc ((void **)&d_output, size*sizeof (float )); cudaMalloc ((void **)&d_kernal, kernal_size*sizeof (float )); cudaMemcpy (d_input, h_input, size*sizeof (float ), cudaMemcpyHostToDevice); cudaMemcpy (d_kernal, h_kernal, kernal_size*sizeof (float ), cudaMemcpyHostToDevice); int blocksize = 64 ; dim3 block (blocksize) ; dim3 grid ((size + block.x - 1 ) / block.x) ; conv1d_gpu1<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size); cudaDeviceSynchronize (); cudaMemcpy (h_output, d_output, size*sizeof (float ), cudaMemcpyDeviceToHost); for (int i=0 ; i<10 ; i++) printf ("output[%d] = %f (GPU 1)\n" , i, h_output[i]); cudaMemcpyToSymbol (KERNAL, h_kernal, kernal_size*sizeof (float ), cudaMemcpyHostToDevice); conv1d_gpu2<<<grid, block>>>(d_input, d_output, size); cudaDeviceSynchronize (); cudaMemcpy (h_output, d_output, size*sizeof (float ), cudaMemcpyDeviceToHost); for (int i=0 ; i<10 ; i++) printf ("output[%d] = %f (GPU 2)\n" , i, h_output[i]); conv1d_gpu3<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size); cudaDeviceSynchronize (); cudaMemcpy (h_output, d_output, size*sizeof (float ), cudaMemcpyDeviceToHost); for (int i=0 ; i<10 ; i++) printf ("output[%d] = %f (GPU 3)\n" , i, h_output[i]); return 0 ; }
convolution2d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 #include <stdio.h> #include <cuda.h> __global__ void conv2d_gpu (float *input, float *output, int size, float *kernal, int kernal_size) { int idx_x = blockIdx.x * blockDim.x + threadIdx.x; int idx_y = blockIdx.y * blockDim.y + threadIdx.y; int start_row = idx_y - kernal_size/2 ; int end_row = start_row + kernal_size; int start_col = idx_x - kernal_size/2 ; int end_col = start_col + kernal_size; output[idx_y*size + idx_x] = 0.0f ; if (start_row < 0 || start_col < 0 || end_row > size || end_col > size) return ; for (int row=start_row; row<end_row; row++) { for (int col=start_col; col<end_col; col++) { atomicAdd (&output[idx_y*size + idx_x], input[row * size + col] * kernal[(row-start_row)*kernal_size + (col-start_col)]); } } } int main () { int size = 64 , kernal_size = 5 ; int bytes = size * size * sizeof (float ), kernal_bytes = kernal_size * kernal_size * sizeof (float ); float *h_input = (float *)malloc (bytes); float *h_output = (float *)malloc (bytes); float *h_kernal = (float *)malloc (kernal_bytes); for (int i = 0 ; i < size*size; i++) h_input[i] = 1.0f ; for (int i = 0 ; i < kernal_size*kernal_size; i++) h_kernal[i] = 1.0f ; float *d_input = nullptr ; float *d_output = nullptr ; float *d_kernal = nullptr ; cudaMalloc ((void **)&d_input, bytes); cudaMalloc ((void **)&d_output, bytes); cudaMalloc ((void **)&d_kernal, kernal_bytes); cudaMemcpy (d_input, h_input, bytes, cudaMemcpyHostToDevice); cudaMemcpy (d_kernal, h_kernal, kernal_bytes, cudaMemcpyHostToDevice); int blocksize = 16 ; dim3 block (blocksize, blocksize, 1 ) ; dim3 grid ((size + block.x - 1 )/block.x, (size + block.y - 1 )/block.y, 1 ) ; conv2d_gpu<<<grid, block>>>(d_input, d_output, size, d_kernal, kernal_size); cudaDeviceSynchronize (); cudaMemcpy (h_output, d_output, bytes, cudaMemcpyDeviceToHost); cudaDeviceSynchronize (); for (int i = 0 ; i<10 ; i++) { for (int j = 0 ; j<10 ; j++) { printf ("%.1f, " , h_output[i*size + j]); } printf ("\n" ); } return 0 ; }