[ad_1]

**Important Note : Original article **https://ecdicus.com/convolutional-neural-networks/ **with correct latex display**

Convolutional neural network was originally used to process image information. When using a fully connected feedforward network to process images, there will be the following two problems:

- Too many parameters: If the input image size is 100 x 100 x 3 (that is, the image height is 100, the width is a100, and RGB 3 color channels), in the fully connected feedforward network, each of the first hidden layer Each neuron has 100 x 100 x 3 = 30 000 independent connections to the input layer, and each connection corresponds to a weight parameter. As the number of neurons in the hidden layer increases, the scale of the parameters will also increase sharply. This will cause the training efficiency of the entire neural network to be very low, and it is also prone to overfitting.
- Local invariance features: all objects in natural images have local invariance features, such as scale scaling, translation, rotation and other operations do not affect their semantic information. However, it is difficult for fully connected feedforward networks to extract these local invariant features, and data enhancement is generally needed to improve performance.

Convolutional neural networks are inspired by the biological receptive field mechanism. The Receptive Field mechanism mainly refers to the characteristics of some neurons in the nervous system such as hearing and vision, that is, neurons only receive signals from the stimulation area they innervate. In the visual nervous system, the output of nerve cells in the visual cortex depends on the photoreceptors on the retina. When the photoreceptors on the retina are stimulated and excited, they transmit nerve impulse signals to the visual cortex, but not all neurons in the visual cortex receive these signals. The receptive field of a neuron refers to a specific area on the retina, and only stimulation in this area can activate the neuron.

The current convolutional neural network is generally a feedforward neural network composed of a convolutional layer, a convergence layer, and a fully connected layer. Convolutional neural networks have three structural characteristics: local connection, weight sharing, and convergence. These characteristics make the convolutional neural network have a certain degree of translation, scaling and rotation invariance. Compared with feedforward neural networks, convolutional neural networks have fewer parameters.

Convolutional neural networks are mainly used in various tasks of image and video analysis (such as image classification, face recognition, object recognition, image segmentation, etc.), and their accuracy is generally far beyond that of other neural network models. In recent years, convolutional neural networks have also been widely used in natural language processing, recommendation systems and other fields.

Convolution, also called convolution, is an important operation in analytical mathematics. In signal processing or image processing, one-dimensional or two-dimensional convolution is often used

## One-dimensional convolution

ne-dimensional convolution is often used in signal processing to calculate the delay accumulation of the signal. Assuming that a signal generator generates a signal $x_t$ at every time t, the attenuation rate of its information is $w_k$, that is, after k-1 time steps, the information is times the original $w_k$. Assuming $w_1 =1, w_2 = 1/2, w_3 = 1/4$, then the signal $y_t$ received at time t is the superposition of the information generated at the current time and the delay information at the previous time,

$y_t= 1 times x_t +1/2 times x_{t-1} +1/4 times x_{t-2}$

$= =w_1 times x_t +w_2 times x_{t-1} +w_3 times x_{t-2}$

$=sum_{k=1}^{3}w_k x_{t-k+1}$

We call $w_1,w_2,hdots $ as a filter (Filter) or a convolution kernel (ConvolutionKernel). Assuming the filter length is K, its convolution with a signal sequence $x_1,x_2,hdots$ is

$y_t =sum_{k=1}^{K}w_k x_{t-k+1}$

For the sake of simplicity, it is assumed that the subscript t of the convolution output $y_t$ starts from K.

The convolution of the signal sequence x and the filter w is defined as

$y=wast x$

Among them, $ast $ means convolution operation. Generally, the length K of the filter is much smaller than the length of the signal sequence x.

We can design different filters to extract different features of the signal sequence. For example, when $w =[1/K,…,1/K] $, the convolution is equivalent to the simple moving average of the signal sequence (the window size is K); when the filter;when the filter w=[1,-2,1], the second-order differentiation of the signal sequence can be approximated, namely

$ {x} »(t)=x(t+1)+x(t-1)-2x(t)$

Figure 1 shows an example of one-dimensional convolution of two filters. It can be seen that the two filters extract different features of the input sequence respectively. The filter w=[1/3,1/3,1/3] can detect the low-frequency information in the signal sequence, and the filter w=[1,-2,1] can detect the high-frequency information in the signal sequence.

Figure 1 — Example of 1D Convolution

## Two-dimensional convolution

Convolution is also often used in image processing. Because the image is a two-dimensional structure, the one-dimensional convolution needs to be extended. Given an image $X in mathbb{R}^{M times N}$ and a filter $W in mathbb{R}^{U times V}$, generally $U << M, V <

$y_{ij}=sum_{u=1}^{U}sum_{v=1}^{V}w_{uv}X_{i-u+1,j-v+1}$

For the sake of simplicity, it is assumed that the subscript (i,j) of the output of the convolution $y_{ij}$ starts from (U,V).

The two-dimensional convolution of input information X and filter W is defined as

$ Y=Wast X$

Among them, $ast$ represents the two-dimensional convolution operation. Figure 2 shows an example of two-dimensional convolution.

Figure 2 — Two-dimensional convolution example

MeanFilter, which is commonly used in image processing, is a two-dimensional convolution. The pixel value at the current position is set as the average value of all pixels in the filter window, that is, $w_{uv}=frac{1}{ UV}$.

In image processing, convolution is often used as an effective method for feature extraction. The result of an image after convolution operation is called FeatureMap. Figure 3 shows several commonly used filters in image processing and their corresponding feature maps. The top filter in the figure is a commonly used Gaussian filter, which can be used to smooth and denoise the image; the middle and bottom filters can be used to extract edge features.

Figure 3 — Examples of several commonly used filters in image processing

In the field of machine learning and image processing, the main function of convolution is to slide a convolution kernel (ie filter) on an image (or a certain feature), and obtain a new set of features through convolution operation. In the process of calculating convolution, the convolution kernel needs to be flipped. In the specific implementation, the cross-correlation operation is generally used to replace the convolution, which will reduce some unnecessary operations or overhead. Cross-Correlation is a function that measures the correlation of two sequences, which is usually realized by the dot product calculation of a sliding window. Given an image $X in mathbb{T}^{M times N}$ and a convolution kernel $W in mathbb{T}^{U times V}$, their cross-correlation is

$ y_{ij}=sum_{u=1}^{U}sum_{v=1}^{V}w_{uv}X_{i+u-1,j+v-1}$

Comparing with formula (7), it can be seen that the difference between cross-correlation and convolution is only whether the convolution kernel is flipped. Therefore, cross-correlation can also be called non-flip convolution.

Formula (9) can be expressed as

$Y=Wotimes X$

$= rot180(W) ast X,$

Among them, $otimes$ represents the cross-correlation operation, rot180(.) represents a rotation of 180 degrees, and $Y in mathbb{R}^{M-U+1,N-V+1}$ is the output matrix.

Convolution is used in neural networks for feature extraction. Whether the convolution kernel is flipped has nothing to do with its feature extraction ability. Especially when the convolution kernel is a learnable parameter, convolution and cross-correlation are equivalent in ability. Therefore, for the convenience of realization (or description), we use cross-correlation instead of convolution. In fact, the convolution operations in many deep learning tools are actually cross-correlation operations.

On the basis of the standard definition of convolution, the sliding step size and zero padding of the convolution kernel can also be introduced to increase the diversity of convolution, and feature extraction can be performed more flexibly.

Stride refers to the time interval when the convolution kernel is sliding. Figure 4a shows an example of convolution with a step size of 2.

Zero padding (ZeroPadding) is zero padding at both ends of the input vector. Figure 4b shows an example of convolution with one zero at both ends of the input.

Figure 4 — Stride and zero padding for convolution (filters are [−1, 0, 1]).

Assuming that the number of input neurons in the convolutional layer is M, the convolution size is K, the step size is S, and P zeros (zeropadding) are filled at both ends of the input, then the number of neurons in the convolutional layer is (M- M+2P)/S+1.

The commonly used convolutions fall into the following three categories:

- Narrow Convolution: Step size S = 1, no zero padding at both ends P = 0, output length after convolution is M-K+1.
- Wide Convolution: Step size S = 1, zero padding at both ends P= K -1, output length M+K-1 after convolution.
- Equal-Width Convolution: Step length S = 1, zero padding at both ends P = (K -1)/2, output length after convolution M. Figure 4b is an example of a constant-width convolution.

Convolution has many good mathematical properties. In this section, we introduce some mathematical properties of two-dimensional convolution. These mathematical properties can also be applied to the case of one-dimensional convolution.

## Exchangeability

If the length of the two convolution signals is not limited, the real flipped convolution is commutative, that is, $x ast y=y ast s$. For the cross-correlation « convolution », it also has a certain « commutability ».

We first introduce the definition of WideConvolution. Given a two-dimensional image $X in mathbb{R}^{ M times N}$ and a two-dimensional convolution kernel $W in mathbb{R}^{ U times V}$, for the image X is zero-padding, and both ends are filled with U-1 and V -1 zeros to get a full-padding (FullPadding) image $X in mathbb{R}^{ (M+2U-2)times(N+ 2V-2)}$. The wide convolution of image X and convolution kernel W is defined as

$ W widetilde{otimes}X overset{Delta}{=} W otimes widetilde{X} $

Among them, $widetilde{otimes}$ represents the wide convolution operation.

When the input information and the convolution kernel have a fixed length, their wide convolution is still commutative, that is

$rot180(W) widetilde{otimes}X = rot180(X) widetilde{otimes} W$

Among them, rot180(.) represents a rotation of 180 degrees.

Suppose $Y=Wotimes X$, where $X in mathbb{R}^{M times N}, W in mathbb{R}^{U times V}, X in mathbb{ R}^{(M-U+1) times (N-V+1)} $, the function $f(Y) in mathbb{R}$ is a scalar function, then

$frac{partial f(Y)}{partial w_{uv}}=sum_{i=1}^{M-U+1}sum_{j=1}^{N-V+1}frac{partial gamma_{ij}}{partial w_{uv}}frac{partial f(Y)}{partial gamma_{ij}}$

$ sum_{i=1}^{M-U+1}sum_{j=1}^{N-V+1} x_{i+u-1,j+v-1} frac{partial f(Y)}{partial gamma_{ij}}$

$sum_{i=1}^{M-U+1}sum_{j=1}^{N-V+1} frac{partial f(Y)}{partial gamma_{ij}} x_{u+i-1,v+j-1} $

It can be seen from formula (16) that the partial derivative of f(Y) with respect to W is the $frac{partial f(Y)}{partial Y}$ convolution of X and

$frac{partial f(Y)}{partial W}=frac{partial f(Y)}{partial Y}otimes X $

Get in the same way,

$frac{partial f(Y)}{partial x_{st}}=sum_{i=1}^{M-U+1}sum_{j=1}^{N-V+1} frac{partial gamma_{ij} }{partial x_{st}}frac{partial f(Y) }{partial gamma_{ij}}$

$= sum_{i=1}^{M-U+1}sum_{j=1}^{N-V+1} w_{s-i+1,t-j+1}frac{partial f(Y) }{partial gamma_{ij}}$

Among them, when (s-i+1) <1, or (s-i+1)> U, or (t-j+1) <1, or (t-j+1)> V, $w_{s -i+1,t-j+1}$. That is, it is equivalent to zero-filling W with P=(M-U,N-V).

It can be seen from formula (19) that the partial derivative of f(Y) with respect to X is the wide convolution of W and $frac{partial f(Y)}{partial Y}$. The convolution in formula (19) is true convolution rather than cross-correlation. For consistency, we use the « convolution » of cross-correlation, namely

$frac{partial f(Y)}{partial X} =rot180(frac{partial f(Y)}{partial Y}) widetilde{otimes} W$

$= rot180(W) widetilde{otimes} frac{partial f(Y)}{partial Y}$

Among them, rot180(.) represents a rotation of 180 degrees.

Convolutional neural networks are generally composed of convolutional layers, convergence layers and fully connected layers.

In a fully connected feedforward neural network, if the l layer has $M_l$ neurons, the l-1th layer has $M_{l-1}$ neurons, and the connected edges have $M_1 times M_{l- 1}$, that is, the weight matrix has $M_1 times M_{l-1}$ parameters. When both $M_l$ and $M_{l-1}$ are very large, the weight matrix has many parameters, and the training efficiency will be very low.

If convolution is used instead of full connection, the net input $z^{(l)}$ of the l layer is the activity value of the l-1th layer $z^{(l-1)}$ and the convolution kernel $w^ {(l)} in mathbb{R}^K $ Convolution, namely

$ z^{(l)}=w^{(l)} otimes x^{(l-1)} +b^{(l)} $

Among them, the convolution kernel $w^{(l)} in mathbb{R}^K $ is the learnable weight vector, and $b^{(l)} in mathbb{R} $ is the learnable bias Set.

Figure 5 — Comparison of fully connected layers and convolutional layers.

Local connection Each neuron in the convolutional layer (assuming it is the l layer) is only connected to the neuron in a certain local window in the previous layer (the l−1 layer) to form a local connection network. As shown in Figure 5b, the number of connections between the convolutional layer and the previous layer is greatly reduced, from the original $M_l times M_{l-1}$ connections to $M_l times K$ connections, K Is the size of the convolution kernel. Sharing can be understood as a convolution kernel that only captures a specific local feature in the input data. Therefore, if you want to extract multiple features, you need to use multiple different convolution kernels.

Due to the sharing of local connections and weights, the parameters of the convolutional layer have only one K-dimensional weight $w^{(l)}$ and 1-dimensional offset $b^{(l)}$, a total of K+1 parameters. The number of parameters has nothing to do with the number of neurons. In addition, the number of neurons in the first layer is not arbitrarily selected, but satisfies $M_l=M_{l-1}-K+1$.

The function of the convolutional layer is to extract the features of a local area, and different convolution kernels are equivalent to different feature extractors. The neurons of the convolutional layer described in the previous section have a one-dimensional structure like the fully connected network. Because the convolutional network is mainly used in image processing, and the image is a two-dimensional structure, in order to make full use of the local information of the image, neurons are usually organized into a three-dimensional structure of the neural layer, the size of which is height Dx width Nx depth D , Is composed of D feature maps of MxN size.

Feature map (FeatureMap) is an image (or other feature map) extracted by convolution, each feature map can be used as a type of extracted image features. In order to improve the representation ability of the convolutional network, multiple different feature maps can be used in each layer to better represent the features of the image.

In the input layer, the feature map is the image itself. If it is a gray-scale image, there is a feature map, and the depth of the input layer is D =1; if it is a color image, there are feature maps of three color channels of RGB, and the depth of the input layer is D=3.

Without loss of generality, suppose the structure of a convolutional layer is as follows:

- Input feature mapping group: $mathcal{X}in mathbb{R}^{ M times N times D}$ is a three-dimensional tensor (Tensor), where each slice (Slice) matrix $X ^d in mathbb{R}^{ M times N }$ is an input feature map, $1 leq d leq D$;
- Output feature mapping group: $mathcal{Y} in mathbb{R}^{{M}’ times {N}’ times P}$ is a three-dimensional tensor, where each slice matrix $Y ^p in mathbb{R}^{{M}’ times {N}’ }$ is an output feature map, $1 leq p leq P$;
- Convolution kernel: $mathcal{W}in mathbb{R}^{ U times V times D times P}$ is a four-dimensional tensor, where each slice matrix $w^{p, d} in mathbb{R}^{ U times V }$ is a two-dimensional convolution kernel, $1 leq p leq P,1 leq d leq D$

Figure 5.6 shows the three-dimensional structure of the convolutional layer.

Figure 6–3D Structure Representation of Convolutional Layers

In order to calculate the output feature map $Y^p$, use the convolution kernel $W^{p,1},W^{p,2},hdots,W^{p,D}$ to map the input feature $X ^1,X²,…,X^D$ perform convolution, then add the convolution results, and add a scalar offset $b^p$ to get the net input $Z^p$ of the convolutional layer, After the nonlinear activation function, the output feature map $Y^p$ is obtained.

$Z^p=W^p otimes X+b^p=sum_{d=1}^{D}W^{p,d} otimes X^d+b^p $

$Y^p=f(Z^p) $

Among them, $W^p in mathbb{R}^{ U times V times D}$ is the three-dimensional convolution kernel, f(.) is the nonlinear activation function, and the ReLU function is generally used.

The whole calculation process is shown in Figure 7 If you want the convolutional layer to output P feature maps, you can repeat the above calculation process P times to get P output feature maps $Y¹,Y²,hdots,Y^P$.

Figure 7 — Example of computation from input feature map group X to output feature map $Y^p$ in a convolutional layer

When the input is $mathcal{X}in mathbb{R}^{M times N times D}$, the output is $mathcal{Y}in mathbb{R}^{{M }’ In the convolutional layer of times {N}’ times P}$, each output feature map requires D convolution kernels and a bias. Assuming that the size of each convolution kernel is $U times V$, then a total of $PxDx(UxV)+P$ parameters are required.

Pooling layer (PoolingLayer) is also called subsampling layer (SubsamplingLayer), its function is to perform feature selection, reduce the number of features, thereby reducing the number of parameters

Although the convolutional layer can significantly reduce the number of connections in the network, the number of neurons in the feature map group does not significantly reduce. If a classifier is followed, the input dimensionality of the classifier is still very high, and it is prone to overfitting. In order to solve this problem, a convergence layer can be added after the convolutional layer to reduce the feature dimension and avoid overfitting.

Assuming that the input feature mapping group of the convergence layer is $mathcal{X}in mathbb{R}^{M times N times D}$, for each feature map $X^d in mathbb{R }^{M times N times D},1 leq d leq D$, divide it into many areas $R_{m,n}^{d},1 leqm leq {M}’,1 leq n leq {N}’$, these areas may or may not overlap. Pooling refers to downsampling each area (DownSampling) to obtain a value as a generalization of this area.

There are two commonly used aggregation functions:

- Maximum pooling (MaximumPooling or MaxPooling): For a region $R_{m,n}^{d}$, select the maximum activity value of all neurons in this region as the representation of this region, namely

$ y_{m,n}^{d}=max_{i in R_{m,n}^{d}} x_i$

Among them, $x_i$ is the activity value of each neuron in the region $R_k^d$.

- Mean Pooling: Generally, the average value of all neuron activity values in the area is taken, namely

$ y_{m,n}^{d}=frac{1}{left | R_{m,n}^{d} right |}sum_{i in R_{m,n}^{d}} x_i$

Sub-sampling the ${M}’ times {N}’$ regions of each input feature map $X_d$ to obtain the output feature map $Y^d={ y_{m,n}^{ d}},1 leqm leq {M}’,1 leq n leq {N}’$.

Figure 5.8 shows an example of sub-sampling operation by sampling maximum convergence. It can be seen that the convergence layer can not only effectively reduce the number of neurons, but also keep the network invariant to some small local morphological changes and have a larger receptive field.

Figure 8 — Example of Maximum Aggregation Process in Aggregation Layer

In the current mainstream convolutional networks, the convergence layer only contains downsampling operations. But in some early convolutional networks (such as LeNet-5), nonlinear activation functions are sometimes used in the convergence layer, such as

$Y^{‘d}=f(w^dY^d+b^d) $

Among them, $Y^{‘d}$ is the output of the convergence layer, f(.) is the non-linear activation function, and $w^d$ and $b^d$ are the learnable scalar weights and biases.

The typical convergence layer divides each feature map into 2×2 non-overlapping regions, and then uses the maximum convergence method for down-sampling. The convergence layer can also be regarded as a special convolution layer, the size of the convolution kernel is KxK, the step size is SxS, and the convolution kernel is a max function or a mean function. Too large a sampling area will drastically reduce the number of neurons and cause excessive information loss.

A typical convolutional network is composed of a convolutional layer, a convergence layer, and a fully connected layer. The overall structure of the commonly used convolutional network is shown in Figure 5.9. A convolutional block consists of consecutive M convolutional layers and b convergence layers (M is usually set to 2∼5, b is 0 or 1). A convolutional network can stack N consecutive convolutional blocks, followed by K fully connected layers (the value range of N is relatively large, such as 1–100 or greater; K is generally 0–2).

Figure 8 — Example of Maximum Aggregation Process in Aggregation Layer

At present, the overall structure of convolutional networks tends to use smaller convolution kernels (such as 1×1 and 3×3) and deeper structures (such as the number of layers greater than 50). In addition, as the operability of convolution becomes more and more flexible (such as different step sizes), the role of the convergence layer has become smaller and smaller. Therefore, the proportion of the convergence layer in the currently popular convolutional networks is gradually decreasing. Tend to full convolutional network.

In the convolutional network, the parameters are the weight and bias in the convolution kernel. Similar to the fully connected feedforward network, the convolutional network can also perform parameter learning through the error backpropagation algorithm.

In a fully connected feedforward neural network, the gradient is mainly back-propagated through the error term $delta$ of each layer, and the gradient of the parameters of each layer is further calculated. In convolutional neural networks, there are mainly two neural layers with different functions: convolutional layer and convergence layer. The parameters are the convolution kernel and the bias, so only the gradient of the parameters in the convolution layer needs to be calculated.

Without loss of generality, the l layer is a convolutional layer, and the input feature mapping of layer l-1 is $mathcal{X}^{(l-1)} in mathbb{R}^{Mtimes N times D}$, the net input of the feature map of the l layer is obtained by convolution calculation $mathcal{Z}^{(l)} in mathbb{R}^{{M}’times {N} ‘times P}$. The $p(1leq p leq P)$ feature map net input of the l layer

$ Z^{(l,p)}=sum_{d=1}^{D}W^{(l,p,d)}otimes X^{(l-1,d)}+b^{(l,p)}$

Among them, $W^{(l,p,d)}$ and $b^{(l,p)}$ are the convolution kernel and the bias. There are a total of $P times D$ convolution kernels and P offsets in the first layer, which can be calculated using the chain rule respectively.

According to formula (17) and formula (28), the partial derivative of the loss function $mathcal{L}$ with respect to the convolution kernel $W^{(l,p,d)}$ of the first layer is

$frac{partial mathcal{L}}{partial W^{(l,p,d)}}=frac{partial mathcal{L}}{partial Z^{(l,p)}}otimes X^{(l-1,d)}$

$= delta^{(l,p)}otimes X^{(l-1,d)}$

Where $delta^{(l,p)}=frac{partial mathcal{L}}{partial Z^{(l,p)}}$ is the p-th feature of the loss function on the l layer Mapping the partial derivative of the net input $Z^{(l,p)}$.

In the same way, the partial derivative of the loss function with respect to the p-th bias $b^{(l,p)}$ of the l-th layer is

$ frac{partial mathcal{L}}{partial b^{(l,p)}}=sum_{i,j} [delta^{(l,p)} ]_{i,j}$

In a convolutional network, the gradient of each layer of parameters depends on the error term of its layer $delta^{(l,p)}$.

The calculation of the error term in the convolutional layer and the convergence layer is different, so we calculate the error terms separately. Convergence layer When the l + 1 layer is the convergence layer, because the convergence layer is a down-sampling operation, the error term of each neuron in the l + 1 layer corresponds to a region of the corresponding feature map of the l + 1 layer. Each neuron in the p-th feature map of layer l has an edge connected to a neuron in the p-th feature map of layer l+1. According to the chain rule, the error term $delta^{(l,p)}$ of a feature map of the lth layer only needs to be the error term of the feature map corresponding to the l + 1 layer $delta^{(l+1 ,p)}$ Perform an upsampling operation (the same size as the l layer), and then multiply it element by element with the activation value partial derivative of the feature map of the l layer, to get $delta^{(l,p)}$ .

The specific derivation process of the error term $delta^{(l,p)}$ of the p-th feature map of the l layer is as follows:

$Adelta^{(l,p)} overset{Delta}{=} frac{partial mathcal{L}}{partial Z^{(l,p)}}$

$= frac{partial X^{(l,p)}}{partial Z^{(l,p)}} frac{partial Z^{(l+1,p)}}{partial X^{(l,p)}} frac{partial mathcal{L}}{partial Z^{(l+1,p)}}$

$={f}_l’(Z^{(l,p)})odot up(delta^{(l+1,p)})$

Among them, ${f}_l’(.)$ is the derivative of the activation function used in the lth layer, and up is the up sampling function, which is just the opposite of the down-sampling operation used in the aggregation layer. If the downsampling is the maximum convergence, each value in the error term $delta^{(l+1,p)}$ will be directly transferred to the neuron corresponding to the maximum value in the corresponding area of the previous layer, and the other areas in this area The error terms of neurons are all set to zero. If the downsampling is average convergence, each value in the error term $delta^{(l+1,p)}$ will be equally distributed to all neurons in the corresponding region of the previous layer.

Convolutional layer When the l + 1 layer is a convolutional layer, assume that the feature mapping net input $mathcal{Z}^{(l+1)}in mathbb{R}^{{M}’ times {N }’times P}$, where the $p(1 leq p leq P)$ feature map net input

$Z^{(l+1,p)}=sum_{d=1}^{D} W^{(l+1,p,d)} otimes X^{(l,d)}+b^{(l+1,p)}$

Among them, $W^{(l+1,p,d)}$ and $b^{(l+1,p)}$ are the convolution kernel and bias of the l+1th layer. There are a total of $P times D$ convolution kernels and P offsets in the l+1 layer.

The specific derivation process of the error term $delta^{(l,d)}$ of the d-th feature map of the l layer is as follows:

$delta^{(l,d)} overset{Delta}{=} frac{partial mathcal{L} }{partial Z^{(l,d)} }$

$= frac{partial X^{(l,d)}}{partial Z^{(l,d)}}frac{partial mathcal{L}}{partial X^{(l,d)}}$

$ {f}_l’(Z^{(l,d)}) odotsum_{p=1}^{P}(rot180(W^{(l+1,p,d)})widetilde{otimes} frac{partial mathcal{L} }{partial Z^{(l+1,d)}})$

$ {f}_l’(Z^{(l,d)}) odotsum_{p=1}^{P}(rot180(W^{(l+1,p,d)})widetilde{otimes} delta^{(l+1,p)}$

Where $widetilde{otimes}$ is wide convolution.

This section introduces several widely used typical deep convolutional neural networks

Although LeNet-5 [LeCun et al., 1998] was proposed earlier, it is a very successful neural network model. The handwritten digit recognition system based on LeNet-5 was used by many banks in the United States in the 1990s to recognize handwritten digits on checks. The network structure of LeNet-5 is shown in Figure 10.

Figure 10 — LeNet-5 network structure (picture drawn according to [LeCun et al., 1998])

LeNet-5 has 7 layers, accepts input image size of 32 x 32 = 1 024, and outputs scores corresponding to 10 categories. The structure of each layer in LeNet-5 is as follows:

- The C1 layer is a convolutional layer, using 6 5×5 convolution kernels to obtain 6 sets of feature maps with a size of 28×28 = 784. Therefore, the number of neurons in the C1 layer is 6 x 784 = 4 704, the number of trainable parameters is 6 x 25 + 6 = 156, and the number of connections is 156 x 784 = 122 304 (including bias, the same below).
- The S2 layer is the convergence layer, with a sampling window of 2 x 2, using average convergence, and a nonlinear function as in equation (27). The number of neurons is 6 x 14 x 14 = 1 176, the number of trainable parameters is 6 x (1 + 1) = 12, and the number of connections is 6 x 196 × (4 + 1) = 5 880
- The C3 layer is a convolutional layer. In LeNet-5, a connection table is used to define the dependency between the input and output feature maps. For the connection table, refer to formula (5.40). As shown in Figure 5.11, a total of 60 5 x 5 convolution kernels are used to obtain 16 sets of 10 x 10 feature maps. The number of neurons is 16 x 100 = 1 600, the number of trainable parameters is (60 x 25) + 16 = 1 516, and the number of connections is 100 x 1 516 = 151 600.
- The S4 layer is a convergence layer, the sampling window is 2 x 2, and 16 feature maps of 5 x 5 size are obtained, the number of trainable parameters is 16 x 2 = 32, and the number of connections is 16 x 25 x (4 + 1 ) = 2 000.
- The C5 layer is a convolutional layer, using 120 x 16 = 1 920 5 x 5 convolution kernels to obtain 120 sets of 1 x 1 feature maps. The number of neurons in the C5 layer is 120, the number of trainable parameters is 1 920 x 25 + 120 = 48 120, and the number of connections is 120 x (16 x 25 + 1) = 48 120.
- The F6 layer is a fully connected layer with 84 neurons, and the number of trainable parameters is 84 x (120 + 1) = 10 164. The number of connections is the same as the number of trainable parameters, which is 10 164.
- Output layer: The output layer is composed of 10 Radial Basis Functions (RBF). No more details here

Connection table It can be seen from formula (23) that each output feature map of the convolutional layer depends on all input feature maps, which is equivalent to a fully connected relationship between the input and output feature maps of the convolutional layer. In fact, this kind of fully connected relationship is not necessary. We can make every output feature map depend on a few input feature maps. Define a link table (Link Table) T to describe the connection relationship between input and output feature maps. In LeNet-5, the basic setting of the connection table is shown in Figure 11. The 0–5th feature maps of the C3 layer depend on every 3 consecutive subsets of the feature map group of the S2 layer, and the 6–11th feature maps depend on every 4 consecutive subsets of the feature map group of the S2 layer. The 12–14 feature maps depend on every 4 discontinuous subsets of the feature maps of the S2 layer, and the 15th feature map depends on all the feature maps of the S2 layer.

Figure 11 — Connection table of layer C3 in LeNet-5 (Image source: [LeCun et al., 1998]).

If the p-th output feature map depends on the d-th input feature map, then $T_{p,d} = 1$, otherwise it is 0. $Y^p$ is

$ Y^p=f(sum_{d,T_{p,d}=1}W^{p,d}otimes X^d +b^p) $

Among them, T is the connection table of size P x D. Assuming that the number of non-zeros in the connection table T is K, and the size of each convolution kernel is U x V, then a total of K x U x V + P parameters are required.

AlexNet[Krizhevsky et al., 2012] is the first modern deep convolutional network model. It uses many modern deep convolutional network technology methods for the first time, such as using GPU for parallel training and ReLU as a nonlinear activation function. Use Dropout to prevent overfitting, use data enhancement to improve model accuracy, etc. AlexNet won the 2012 ImageNet image classification competition.

The structure of lexNet is shown in Figure 12, including 5 convolutional layers, 3 convergence layers and 3 fully connected layers (the last layer is the output layer using the Softmax function). Because the size of the network exceeded the memory limit of a single GPU at the time, AlexNet split the network into two halves and placed them on two GPUs. The GPUs only communicate at certain layers (such as layer 3).

Figure 12 — AlexNet network structure.

The input of AlexNet is an image of 224 x 224 x 3, and the output is the conditional probability of 1,000 categories. The specific structure is as follows:

- The first convolutional layer uses two convolution kernels of size 11x11x3x48, step size S = 4, zero padding P = 3, and two feature mapping groups of size 55 x 55 x 48 are obtained.
- The first convergence layer uses a maximum convergence operation with a size of 3 x 3, with a step size of S = 2, and two 27 x 27 x 48 feature mapping groups are obtained.
- The second convolution layer uses two convolution kernels of size 5x5x48x128, step size S = 1, zero padding P = 2, and two feature mapping groups of size 27 x 27 x 128 are obtained.
- The second convergence layer uses a maximum convergence operation with a size of 3 x 3, with a step size of S = 2, and two feature mapping groups with a size of 13 x 13 × 128 are obtained.
- The third convolutional layer is the fusion of two paths, using a convolution kernel with a size of 3 x 3 x 256 x 384, step size S = 1, zero padding P = 1, and two sizes of 13 are obtained. x 13 x 192 feature map group
- The fourth convolutional layer uses two convolution kernels of size 3 x 3 x 192 x 192, step size S = 1, zero padding P = 1, and two features of size 13 x 13 x 192 are obtained Mapping group.
- The fifth convolutional layer, using two convolution kernels of size 3 x 3 x 192 x 128, step size S = 1, zero padding P = 1, get two sizes of 13 x 13 x 128 Feature mapping group.
- The third aggregation layer uses the maximum aggregation operation with a size of 3 x 3, with a step size of S = 2, and two feature mapping groups with a size of 6 x 6 x 128 are obtained.
- Three fully connected layers, the number of neurons is 4 096, 4 096, and 1,000, respectively.

In addition, AlexNet also performs Local Response Normalization (LRN) after the first two convergence layers to enhance the generalization ability of the model.

In a convolutional network, how to set the size of the convolution kernel of the convolutional layer is a very critical issue. In the Inception network, a convolutional layer contains multiple convolution operations of different sizes, which is called the Inception module. The Inception network is composed of multiple Inception modules and a small number of convergence layers stacked.\

The Inception module uses convolution kernels of different sizes such as 1 x 1, 3 x 3, 5 x 5, etc., and stitches (stacks) the obtained feature maps in depth as output feature maps.

Figure 13 shows the v1 version of the Inception module structure, using 4 groups of parallel feature extraction methods, namely 1 x 1, 3 x 3, 5 x 5 convolution and 3 x 3 maximum convergence. At the same time, in order to improve the computational efficiency and reduce the number of parameters, the Inception module performs a 1 x 1 convolution before the 3 x 3, 5 x 5 convolution and after the 3 x 3 maximum convergence to reduce the depth of the feature map. If there is redundant information between the input feature maps, a 1 x 1 convolution is equivalent to a feature extraction first

Figure 13 — Module structure of Inception v1.

There are multiple versions of the Inception network. The earliest version of Inception v1 is the very famous GoogLeNet [Szegedy et al., 2015]. GoogLeNet won the 2014 ImageNet image division. GoogLeNet consists of 9 Inception v1 modules and 5 aggregation layers and some others. The convolutional layer and the fully connected layer constitute a total of 22 layers of networks, as shown in Figure 14.

Figure 14 — GoogLeNet network structure (Image source: [Szegedy et al., 2015])

In order to solve the problem of vanishing gradient, GoogLeNet introduces two auxiliary classifiers in the middle layer of the network to strengthen the supervision information.

There are several improved versions of the Inception network, among which the Inception v3 network is more representative [Szegedy et al., 2016]. The Inception v3 network replaces the large convolution kernel with a multi-layer small convolution kernel to reduce the amount of calculation and parameters, and keep the receptive field unchanged. Specifically: 1) Use two layers of 3 x 3 convolution to replace the 5 x 5 convolution in v1; 2) Use consecutive K × 1 and 1 x K to replace the K x K convolution. In addition, the Inception v3 network also introduced optimization methods such as label smoothing and batch normalization for training.

Residual Network (ResNet) improves the efficiency of information dissemination by adding a shortcut connection (also called Residual Connection) to the nonlinear convolutional layer.

Suppose that in a deep network, we expect a nonlinear unit (which can be one or more convolutional layers) $f(x; theta)$ to approximate an objective function h(x). If you split the objective function into two parts: Identity Function (Identity Function) x and Residue Function (Residue Function) $h(x)-x$

$h(x)=underbrace{x}_{text{Identity function}}+underbrace{(h(x)-x)}_{text{Residual function}}$

According to the general approximation theorem, a nonlinear unit composed of a neural network has enough ability to approximate the original objective function or residual function, but in practice the latter is easier to learn [He et al., 2016].

Therefore, the original optimization problem can be transformed into: let the nonlinear unit $f(x; theta)$ approximate the residual function $h(x)-x$, and use $f(x; theta) + x$ to go Approximate h(x).

Figure 15 shows an example of a typical residual unit. The residual unit is composed of multiple cascaded (equal width) convolutional layers and a cross-layer straight connecting edge, and the output is obtained after activation by ReLU.

The residual network is a very deep network formed by connecting many residual units in series. Similar to the residual network is Highway Network [Srivastava et al., 2015].

Figure 15 — A Simple Residual Unit Structure.

In Section 1.3, some variants of convolution are introduced. Different convolution operations can be performed by step size and zero padding. This section introduces some other convolution methods

We can generally achieve the conversion of high-dimensional features to low-dimensional features through convolution operations. For example, in one-dimensional convolution, a 5-dimensional input feature passes through a convolution kernel of size 3, and its output is a 3-dimensional feature. If the step size is greater than 1, the dimensionality of the output feature can be further reduced. But in some tasks, we need to map low-dimensional features to high-dimensional features, and we still hope to achieve it through convolution.

Suppose there is a high-dimensional vector $x in mathbb{R}^d $ and a low-dimensional vector $z in mathbb{R}^p $, p

$ z=W^Tx$

Among them, $W in mathbb{R}^{p times d}$ is the conversion matrix. We can easily realize the reverse mapping from low-dimensional to high-dimensional by transposing W, namely

$x=W^Tz$

It should be noted that formula (42) and formula (43) are not inverse operations, and the two mappings are only formal transposition relations.

In a fully connected network, ignoring the activation function, forward calculation and back propagation are a kind of transposition relationship. For example, in forward calculation, the net input of layer l + 1 is $z^{(l+1)}=W^{(l+1)}z^{(l)}$. The error term of the layer is $delta^((l+1))=(W^((l+1)))^Tdelta^((l))$.

The convolution operation can also be written in the form of affine transformation. Suppose a 5-dimensional vector x is convolved through a convolution kernel of size 3 $w = [w_1, w_2, w_3] T$ to obtain a 3-dimensional vector z. The convolution operation can be written as

$z =w otimes x$

$

=begin{bmatrix}

w_1 & w_2 & w_3 & 0 & 0\

0&w_1 & w_2 & w_3 & 0 \

0&0&w_1 & w_2 & w_3 \

end{bmatrix}x

$

$=Cx$

Where C is a sparse matrix whose non-zero elements come from the elements in the convolution kernel w.

If you want to achieve the mapping from a 3-dimensional vector z to a 5-dimensional vector x, you can achieve it by transposing the affine matrix, that is

$x = C^Tz $

$= begin{bmatrix}

w_1 & 0&0 \

w_2 & w_1 & 0\

w_3 & w_2 &w_1 \

0&w_3 &w_2 \

0 & 0 & w_2

end{bmatrix}z$

$rot180(w)widetilde{otimes} z$

Among them, rot180(.) represents a rotation of 180 degrees. It can be seen from formula (45) and formula (48) that from the perspective of affine transformation, two convolution operations $z = wotimesx $ and $x=rot180(w)widetilde{otimes} z$ It is also a formal transposition relationship. Therefore, we call the convolution operation of mapping low-dimensional features to high-dimensional features as Transposed Convolution [Dumoulin et al., 2016], also known as Deconvolution [Zeiler et al., 2011].

In the convolutional network, the forward calculation and back propagation of the convolutional layer are also a kind of transposition relationship.

For an M-dimensional vector z and a convolution kernel of size K, if you want to map to a higher-dimensional vector through the convolution operation, you only need to pad the vector z at both ends P = K-1, and then Convolution, can get M + K-1 dimensional vector.

Transposed convolution is also applicable to two-dimensional convolution. Figure 5.16 shows a two-dimensional convolution with a step size of S = 1, no zero padding P = 0 and its corresponding transposed convolution.

Figure 16–2D convolution with stride S = 1 and no zero padding P = 0 and its corresponding transposed convolution

Micro-step convolution We can realize the down-sampling operation of input features by increasing the step size S> 1 of the convolution operation, and greatly reduce the feature dimension. Similarly, we can also achieve upsampling by reducing the step size S <1 of the transposed convolution, which greatly increases the feature dimension. The transposed convolution with step size S <1 is also called Fractionally-Strided Convolution [Long et al., 2015]. In order to achieve micro-step convolution, we can insert 0 between the input features to indirectly make the step size smaller.

If the step size of the convolution operation is S> 1, it is hoped that the step size of the corresponding transposed convolution is $frac{1}{S}$, and S-1 0s need to be inserted between the input features to make it The speed of movement becomes slower.\

Take one-dimensional transposed convolution as an example. For an M-dimensional vector z and a convolution kernel of size K, the vector z is filled with zeros at both ends P = K-1, and between every two vector elements Insert D zeros in between, and then perform convolution with a step size of 1, to obtain a (D + 1) × (M-1) + K-dimensional vector.

Figure 17 shows a two-dimensional convolution with a step size of S = 2 and no zero padding P = 0 and its corresponding transposed convolution

Figure 17–2D convolution with stride S = 2 and no zero padding P = 0 and its corresponding transposed convolution

For a convolutional layer, if you want to increase the receptive field of the output unit, it can generally be achieved in three ways: 1) increase the size of the convolution kernel; 2) increase the number of layers, for example, two layers of 3 x 3 convolution can be approximately one The effect of layer 5 × 5 convolution; 3) The convergence operation is performed before the convolution. The first two methods will increase the number of parameters, while the third method will lose some information.

Atrous Convolution is a method that does not increase the number of parameters while increasing the receptive field of the output unit. It is also called Dilated Convolution [Chen et al., 2018; Yu et al., 2015 ].

Hole convolution increases its size in disguised form by inserting « holes » into the convolution kernel. If D-1 holes are inserted between every two elements of the convolution kernel, the effective size of the convolution kernel is

${K}’=K+(K-1) times (D-1)$

Among them, D is called the Dilation Rate. When D = 1, the convolution kernel is an ordinary convolution kernel

Figure 18 shows an example of hole convolution.

Figure 18 — Atrous Convolution

Convolutional neural networks are inspired by the biological receptive field mechanism. In 1959, [Hubel et al., 1959] discovered that there are two types of cells in the primary visual cortex of cats: simple cells and complex cells. These two kinds of cells undertake different levels of visual perception functions [Hubel et al., 1962]. The receptive field of simple cells is long and narrow. Each simple cell is only sensitive to the light band of a certain angle (orientation) in the receptive field, while complex cells are sensitive to light of a certain angle (orientation) moving in a specific direction in the receptive field. With sensitive. Inspired by this, Kunihiko Fukushima proposed a multi-layer neural network with convolution and sub-sampling operations: Neocognitron [Fukushima, 1980]. But there was no backpropagation algorithm at that time, and Xinzhiji used unsupervised learning to train. [LeCun et al., 1989] introduced the back-propagation algorithm into the convolutional neural network, and achieved great success in handwritten digit recognition [LeCun et al., 1998].

AlexNet [Krizhevsky et al., 2012] is the first modern deep convolutional network model, which can be said to be the beginning of a real breakthrough in image classification by deep learning technology. AlexNet does not need pre-training and layer-by-layer training. It uses many modern deep network technologies for the first time, such as using GPU for parallel training, using ReLU as a non-linear activation function, using Dropout to prevent overfitting, and using data augmentation to improve model accuracy. Wait. These technologies have greatly promoted the development of end-to-end deep learning models.

After AlexNet, many excellent convolutional networks appeared, such as VGG network [Simonyan et al., 2014], Inception v1, v2, v4 network [Szegedy et al., 2015, 2016, 2017], residual network [He et al., 2016] and so on.

At present, convolutional neural networks have become the mainstream model in the field of computer vision. By introducing cross-layer straight edges, it is possible to train hundreds or even thousands of layers of convolutional networks. With the increase in the number of network layers, the convolution layer increasingly uses small convolution kernels of 1×1 and 3×3 sizes, and some irregular convolution operations have also appeared, such as hole convolution [Chen et al. , 2018; Yu et al., 2015], deformable convolution [Dai et al., 2017], etc. The network structure also gradually tends to the Fully Convolutional Network (FCN) [Long et al., 2015], reducing the role of the convergence layer and the fully connected layer.

For visualization examples of various convolution operations, please refer to [Dumoulin et al., 2016].

[ad_2]

Source link