Notes on CoveNet

1 minute read

Published:

This is my study note on the derivations for convolutional neural networks. If you have any question, please feel free to contact me by zhongzisha@outlook.com. Any comments are greatly appreciated!

Activation functions

sigmoid function

\begin{aligned} f\left(z\right) & = & \frac{1}{1+\exp\left(-z\right)}\end{aligned}

\begin{aligned} \frac{\partial f}{\partial z} & = & \frac{\exp\left(-z\right)}{\left(1+\exp\left(-z\right)\right)\^{2}}=f\left(z\right)\left(1-f\left(z\right)\right)\end{aligned}

hyperbolic tangent function

\begin{aligned} f\left(z\right) & = & A\tanh\left(Bz\right)=1.7159\tanh\left(0.6666z\right)=A\frac{\exp\left(2Bz\right)-1}{\exp\left(2Bz\right)+1}\end{aligned}

\begin{aligned} \frac{\partial f}{\partial z} & = & A\frac{2B\exp\left(2Bz\right)\left(\exp\left(2Bz\right)+1\right)-\left(\exp\left(2Bz\right)-1\right)2B\exp\left(2Bz\right)}{\left(\exp\left(2Bz\right)+1\right)\^{2}}=2AB\frac{2\exp\left(2Bz\right)}{\left(\exp\left(2Bz\right)+1\right)\^{2}}=2B\left(1-f\left(z\right)\^{2}\right)\end{aligned}

RELU function

\begin{aligned} f\left(z\right) & = & \max\left(0,z\right)\end{aligned}

\begin{aligned} \frac{\partial f}{\partial z} & = & \begin{cases} 0 & x<0\\ 1 & x>0 \end{cases}\end{aligned}

Leaky ReLU

\begin{aligned} f\left(z\right) & = & \begin{cases} z & z\ge0\\ \frac{z}{a} & z<0 \end{cases}\end{aligned}

where a\in\left(1,+\infty\right)

Parametric ReLU

The a in Leaky ReLU is learned in the training via backpropagation.

Randomized Leaky ReLU

\begin{aligned} f\left(z\right) & = & \begin{cases} z & z\ge0\\ \frac{z}{a} & z<0 \end{cases}\end{aligned}

where a\sim U\left(l,u\right) is sampled from a uniform distribution.

Neural Net

We have a set of training samples: \left\{ x,y\right\} _{i=1}\^{N}, the neuron is defined as

\begin{aligned} z\^{\left(l+1\right)} & = & W\^{\left(l\right)T}x\^{\left(l\right)}+b\^{\left(l\right)}\end{aligned}

x\^{\left(l+1\right)}=a\^{\left(l+1\right)}=f\left(z\^{\left(l+1\right)}\right)=\frac{1}{1+e\^{-z\^{\left(l+1\right)}}}

where x\^{\left(l\right)}\in R\^{n_{l}\times1}, z\^{\left(l+1\right)}\in R\^{n_{l+1}\times1}, b\^{\left(l\right)}\in R\^{n_{l+1}\times1}, W\^{\left(l\right)}\in R\^{n_{l}\times n_{l+1}}.

then, we have

\begin{aligned} \frac{\partial z\^{\left(l+1\right)}}{\partial x\^{\left(l\right)}} & =\frac{\partial z\^{\left(l+1\right)}}{\partial a\^{\left(l\right)}}= & W\^{\left(l\right)T},\quad\frac{\partial z\^{\left(l+1\right)}}{\partial W\^{\left(l\right)}}=x\^{\left(l\right)},\quad\frac{\partial z\^{\left(l+1\right)}}{\partial b\^{\left(l\right)}}=I_{n_{l+1}\times n_{l+1}}\end{aligned}

\frac{\partial a\^{\left(l+1\right)}}{\partial z\^{\left(l+1\right)}}=\frac{\partial f\left(z\^{\left(l+1\right)}\right)}{\partial z\^{\left(l+1\right)}}=f'\left(z\^{\left(l+1\right)}\right)\in R\^{n_{l+1}\times1}

Suppose we have an l_{max}-layer network. The loss function is defined as

\begin{aligned} L & = & \frac{1}{2}\left|\left|y-a\^{\left(l_{max}\right)}\right|\right|_{2}\^{2}\end{aligned}

then,

\begin{aligned} \delta\^{\left(a\^{\left(l\right)}\right)} & = & \frac{\partial L}{\partial a\^{\left(l\right)}}=\begin{cases} -\left(y-a\^{\left(l_{max}\right)}\right) & l=l_{max}\\ \frac{\partial L}{\partial z\^{\left(l+1\right)}}\frac{\partial z\^{\left(l+1\right)}}{\partial a\^{\left(l\right)}}=W\^{\left(l\right)}\delta\^{\left(z\^{l+1}\right)} & otherwise \end{cases}\end{aligned}

\begin{aligned} \delta\^{\left(z\^{\left(l\right)}\right)} & = & \frac{\partial L}{\partial z\^{\left(l\right)}}=\frac{\partial L}{\partial a\^{\left(l\right)}}\frac{\partial a\^{\left(l\right)}}{\partial z\^{\left(l\right)}}=\delta\^{\left(a\^{\left(l\right)}\right)}\circ f'\left(z\^{\left(l+1\right)}\right)\in R\^{n_{l}\times1}\end{aligned}

then

\begin{aligned} \nabla_{W\^{\left(l\right)}}L & = & \frac{\partial L}{\partial W\^{\left(l\right)}}=\frac{\partial L}{\partial z\^{\left(l+1\right)}}\frac{\partial z\^{\left(l+1\right)}}{\partial W\^{\left(l\right)}}=a\^{\left(l\right)}\delta\^{\left(z\^{\left(l+1\right)}\right)T}\in R\^{n_{l}\times n_{l+1}}\end{aligned}

\begin{aligned} \nabla_{b\^{\left(l\right)}}L & = & \frac{\partial L}{\partial b\^{\left(l\right)}}=\frac{\partial L}{\partial z\^{\left(l+1\right)}}\frac{\partial z\^{\left(l+1\right)}}{\partial b\^{\left(l\right)}}=\delta\^{\left(z\^{\left(l+1\right)}\right)}\in R\^{n_{l+1}\times1}\end{aligned}

A Neural Net Case

Suppose we have

\begin{aligned} a\^{\left(1\right)} & = & x\in R\^{n_{1}\times1}\end{aligned}

z\^{\left(2\right)}=W\^{\left(1\right)T}a\^{\left(1\right)}+b\^{\left(1\right)},\quad W\^{\left(1\right)}\in R\^{n_{1}\times n_{2}},b\^{\left(1\right)}\in R\^{n_{2}\times1}

a\^{\left(2\right)}=f\left(z\^{\left(2\right)}\right)\in R\^{n_{2}\times1}

z\^{\left(3\right)}=W\^{\left(2\right)T}a\^{\left(2\right)}+b\^{\left(2\right)},\quad W\^{\left(2\right)}\in R\^{n_{2}\times n_{3}},b\^{\left(2\right)}\in R\^{n_{3}\times1}

a\^{\left(3\right)}=f\left(z\^{\left(3\right)}\right)\in R\^{n_{3}\times1}

z\^{\left(4\right)}=W\^{\left(3\right)T}a\^{\left(3\right)}+b\^{\left(3\right)},\quad W\^{\left(3\right)}\in R\^{n_{3}\times n_{4}},b\^{\left(3\right)}\in R\^{n_{4}\times1}

a\^{\left(4\right)}=f\left(z\^{\left(4\right)}\right)\in R\^{n_{4}\times1}

For Euclidean loss:

\begin{aligned} L & = & \frac{1}{2}\left|\left|y-a\^{\left(4\right)}\right|\right|_{2}\^{2},\quad y\in R\^{n_{4}\times1}\end{aligned}

For cross-entropy loss:

\begin{aligned} a_{i}\^{\left(4\right)} & = & \frac{\exp\left(z_{i}\^{\left(4\right)}\right)}{\sum_{j=1}\^{n_{4}}\exp\left(z_{j}\^{\left(4\right)}\right)}\in R\end{aligned}

\begin{aligned} \frac{\partial a_{i}\^{\left(4\right)}}{\partial z_{i}\^{\left(4\right)}} & = & \frac{\exp\left(z_{i}\^{\left(4\right)}\right)\sum_{j=1}\^{n_{4}}\exp\left(z_{j}\^{\left(4\right)}\right)-\exp\left(z_{i}\^{\left(4\right)}\right)\exp\left(z_{i}\^{\left(4\right)}\right)}{\left\{ \sum_{j=1}\^{n_{4}}\exp\left(z_{j}\^{\left(4\right)}\right)\right\} \^{2}}=a_{i}\^{\left(4\right)}-a_{i}\^{\left(4\right)}a_{i}\^{\left(4\right)}=a_{i}\^{\left(4\right)}\left(1-a_{i}\^{\left(4\right)}\right)\end{aligned}

\begin{aligned} \frac{\partial a_{i}\^{\left(4\right)}}{\partial z_{k}\^{\left(4\right)}} & = & \frac{-\exp\left(z_{i}\^{\left(4\right)}\right)\exp\left[\left(z_{k}\^{\left(4\right)}\right)\right]}{\left\{ \sum_{j=1}\^{n_{4}}\exp\left(z_{j}\^{\left(4\right)}\right)\right\} \^{2}}=-a_{i}\^{\left(4\right)}a_{k}\^{\left(4\right)},\quad i\neq k\end{aligned}

\begin{aligned} L & = & -\sum_{i=1}\^{C}\log\left\{ \left(a_{i}\^{\left(4\right)}\right)\^{y_{i}}\right\} =-\sum_{i=1}\^{C}y_{i}\log\left(a_{i}\^{\left(4\right)}\right)=-\sum_{j\neq i}\^{C}y_{j}\log\left(a_{j}\^{\left(4\right)}\right)-y_{i}\log\left(a_{i}\^{\left(4\right)}\right)\end{aligned}

\begin{aligned} \frac{\partial L}{\partial a_{i}\^{\left(4\right)}} & = & -y_{i}\frac{1}{a_{i}\^{\left(4\right)}}\end{aligned}

\begin{aligned} \frac{\partial L}{\partial z_{i}\^{\left(4\right)}} & = & =-\sum_{j\neq i}\^{C}y_{j}\frac{1}{a_{j}\^{\left(4\right)}}\left(-a_{j}\^{\left(4\right)}a_{i}\^{\left(4\right)}\right)-y_{i}\frac{1}{a_{i}\^{\left(4\right)}}a_{i}\^{\left(4\right)}\left(1-a_{i}\^{\left(4\right)}\right)=\sum_{j\neq i}\^{C}y_{j}a_{i}\^{\left(4\right)}-y_{i}\left(1-a_{i}\^{\left(4\right)}\right)=\sum_{j=1}\^{C}y_{j}a_{i}\^{\left(4\right)}-y_{i}=a_{i}\^{\left(4\right)}-y_{i}\end{aligned}

then, we could get the error derivatives and the update rules as follows (suppose we use Euclidean loss and sigmoid activation function):

\begin{aligned} \delta\^{\left(a\^{\left(4\right)}\right)} & = & \frac{\partial L}{\partial a\^{\left(4\right)}}=-\left(y-a\^{\left(4\right)}\right)\in R\^{n_{4}\times1}\end{aligned}

\delta\^{\left(z\^{\left(4\right)}\right)}=\frac{\partial L}{\partial z\^{\left(4\right)}}=\frac{\partial L}{\partial a\^{\left(4\right)}}\frac{\partial a\^{\left(4\right)}}{\partial z\^{\left(4\right)}}=\delta\^{\left(a\^{\left(4\right)}\right)}\circ a\^{\left(4\right)}\circ\left(1-a\^{\left(4\right)}\right)\in R\^{n_{4}\times1}

\begin{aligned} \nabla_{W\^{\left(3\right)}}L & = & \frac{\partial L}{\partial W\^{\left(3\right)}}=\frac{\partial L}{\partial z\^{\left(4\right)}}\frac{\partial z\^{\left(4\right)}}{\partial W\^{\left(3\right)}}=a\^{\left(3\right)}\delta\^{\left(z\^{\left(4\right)}\right)T}\in R\^{n_{3}\times n_{4}}\end{aligned}

\nabla_{b\^{\left(3\right)}}L=\frac{\partial L}{\partial b\^{\left(3\right)}}=\frac{\partial L}{\partial z\^{\left(4\right)}}\frac{\partial z\^{\left(4\right)}}{\partial b\^{\left(3\right)}}=\delta\^{\left(z\^{\left(4\right)}\right)}\in R\^{n_{4}\times1}

\begin{aligned} \delta\^{\left(a\^{\left(3\right)}\right)} & = & \frac{\partial L}{\partial a\^{\left(3\right)}}=\frac{\partial L}{\partial z\^{\left(4\right)}}\frac{\partial z\^{\left(4\right)}}{\partial a\^{\left(3\right)}}=W\^{\left(3\right)}\delta\^{\left(z\^{\left(4\right)}\right)}\in R\^{n_{3}\times1}\end{aligned}

\delta\^{\left(z\^{\left(3\right)}\right)}=\frac{\partial L}{\partial z\^{\left(3\right)}}=\frac{\partial L}{\partial a\^{\left(3\right)}}\frac{\partial a\^{\left(3\right)}}{\partial z\^{\left(3\right)}}=\delta\^{\left(a\^{\left(3\right)}\right)}\circ a\^{\left(3\right)}\circ\left(1-a\^{\left(3\right)}\right)\in R\^{n_{3}\times1}

\begin{aligned} \nabla_{W\^{\left(2\right)}}L & = & \frac{\partial L}{\partial W\^{\left(2\right)}}=\frac{\partial L}{\partial z\^{\left(3\right)}}\frac{\partial z\^{\left(3\right)}}{\partial W\^{\left(2\right)}}=a\^{\left(2\right)}\delta\^{\left(z\^{\left(3\right)}\right)T}\in R\^{n_{2}\times n_{3}}\end{aligned}

\nabla_{b\^{\left(2\right)}}L=\frac{\partial L}{\partial b\^{\left(2\right)}}=\frac{\partial L}{\partial z\^{\left(3\right)}}\frac{\partial z\^{\left(3\right)}}{\partial b\^{\left(2\right)}}=\delta\^{\left(z\^{\left(3\right)}\right)}\in R\^{n_{3}\times1}

\begin{aligned} \delta\^{\left(a\^{\left(2\right)}\right)} & = & \frac{\partial L}{\partial a\^{\left(2\right)}}=\frac{\partial L}{\partial z\^{\left(3\right)}}\frac{\partial z\^{\left(3\right)}}{\partial a\^{\left(2\right)}}=W\^{\left(2\right)}\delta\^{\left(z\^{\left(3\right)}\right)}\in R\^{n_{2}\times1}\end{aligned}

\delta\^{\left(z\^{\left(2\right)}\right)}=\frac{\partial L}{\partial z\^{\left(2\right)}}=\frac{\partial L}{\partial a\^{\left(2\right)}}\frac{\partial a\^{\left(2\right)}}{\partial z\^{\left(2\right)}}=\delta\^{\left(a\^{\left(2\right)}\right)}\circ a\^{\left(2\right)}\circ\left(1-a\^{\left(2\right)}\right)\in R\^{n_{2}\times1}

\begin{aligned} \nabla_{W\^{\left(1\right)}}L & = & \frac{\partial L}{\partial W\^{\left(1\right)}}=\frac{\partial L}{\partial z\^{\left(2\right)}}\frac{\partial z\^{\left(2\right)}}{\partial W\^{\left(1\right)}}=a\^{\left(1\right)}\delta\^{\left(z\^{\left(2\right)}\right)T}\in R\^{n_{1}\times n_{2}}\end{aligned}

\nabla_{b\^{\left(1\right)}}L=\frac{\partial L}{\partial b\^{\left(1\right)}}=\frac{\partial L}{\partial z\^{\left(2\right)}}\frac{\partial z\^{\left(2\right)}}{\partial b\^{\left(1\right)}}=\delta\^{\left(z\^{\left(2\right)}\right)}\in R\^{n_{2}\times1}

\begin{aligned} \delta\^{\left(a\^{\left(1\right)}\right)} & = & \frac{\partial L}{\partial a\^{\left(1\right)}}=\frac{\partial L}{\partial z\^{\left(2\right)}}\frac{\partial z\^{\left(2\right)}}{\partial a\^{\left(1\right)}}=W\^{\left(1\right)}\delta\^{\left(z\^{\left(2\right)}\right)}\in R\^{n_{1}\times1}\end{aligned}

\delta\^{\left(z\^{\left(1\right)}\right)}=\frac{\partial L}{\partial z\^{\left(1\right)}}=\frac{\partial L}{\partial a\^{\left(2\right)}}\frac{\partial a\^{\left(2\right)}}{\partial z\^{\left(1\right)}}=\delta\^{\left(a\^{\left(1\right)}\right)}\circ a\^{\left(1\right)}\circ\left(1-a\^{\left(1\right)}\right)\in R\^{n_{1}\times1}

A CNN Case

I-C1-MP1-FC1-O

Suppose the net structure is I-C1-MP1-FC1-O, then we have

\begin{aligned} a\^{\left(1\right)} & = & x\in R\^{H\times W\times B}\end{aligned}

zc_{j}\^{\left(1\right)}=\sum_{i=1}\^{B}a_{i}\^{\left(1\right)}\star k_{ij}\^{\left(1\right)}+b_{j}\^{\left(1\right)}\in R\^{\left(H-h+1\right)\times\left(W-w+1\right)},\quad j=1,\cdots,F_{1}

ac_{j}\^{\left(1\right)}=f\left(zc_{j}\^{\left(1\right)}\right)\in R\^{\left(H-h+1\right)\times\left(W-w+1\right)}

zp_{j}\^{\left(1\right)}=maxpool(ac_{j}\^{\left(1\right)},poolsize)\in R\^{\left(\frac{H-h+1}{poolsize}\right)\times\left(\frac{W-w+1}{poolsize}\right)}

ap_{j}\^{\left(1\right)}=zp_{j}\^{\left(1\right)}\in R\^{\left(\frac{H-h+1}{poolsize}\right)\times\left(\frac{W-w+1}{poolsize}\right)}

\begin{aligned} a\^{\left(2\right)} & = & reshape(ap_{j}\^{\left(1\right)})\in R\^{n_{2}\times1},\quad n_{2}=\left(\frac{H-h+1}{poolsize}\right)\times\left(\frac{W-w+1}{poolsize}\right)\times F_{1}\end{aligned}

z\^{\left(3\right)}=W\^{\left(2\right)T}a\^{\left(2\right)}+b\^{\left(2\right)},\quad W\^{\left(2\right)}\in R\^{n_{2}\times n_{3}},b\^{\left(2\right)}\in R\^{n_{3}\times1}

a\^{\left(3\right)}=f_{3}\left(z\^{\left(3\right)}\right)\in R\^{n_{3}\times1}

For Euclidean loss:

\begin{aligned} L & = & \frac{1}{2}\left|\left|y-a\^{\left(3\right)}\right|\right|_{2}\^{2}\in R\^{n_{3}\times1}\end{aligned}

For cross-entropy loss:

\begin{aligned} a_{i}\^{\left(3\right)} & = & \frac{\exp\left(z_{i}\^{\left(3\right)}\right)}{\sum_{k}\exp\left(z_{k}\^{\left(3\right)}\right)}\end{aligned}

\begin{aligned} \frac{\partial a_{i}\^{\left(3\right)}}{\partial z_{i}\^{\left(3\right)}} & = & \frac{\exp\left(z_{i}\^{\left(3\right)}\right)\sum_{k}\exp\left(z_{k}\^{\left(3\right)}\right)-\exp\left(z_{i}\^{\left(3\right)}\right)\exp\left(z_{i}\^{\left(3\right)}\right)}{\left(\sum_{k}\exp\left(z_{k}\^{\left(3\right)}\right)\right)\^{2}}=a_{i}\^{\left(3\right)}\left(1-a_{i}\^{\left(3\right)}\right)\end{aligned}

\begin{aligned} \frac{\partial a_{j}\^{\left(3\right)}}{\partial z_{i}\^{\left(3\right)}} & = & \frac{-\exp\left(z_{j}\^{\left(3\right)}\right)\exp\left(z_{i}\^{\left(3\right)}\right)}{\left(\sum_{k}\exp\left(z_{k}\^{\left(3\right)}\right)\right)\^{2}}=-a_{j}\^{\left(3\right)}a_{i}\^{\left(3\right)},\quad i\neq j\end{aligned}

\begin{aligned} L & = & -\sum_{i=1}\^{C}y_{i}\log\left(a_{i}\^{\left(3\right)}\right)=-\sum_{j\neq i}y_{j}\log\left(a_{j}\^{\left(3\right)}\right)-y_{i}\log\left(a_{i}\^{\left(3\right)}\right)\end{aligned}

\begin{aligned} \frac{\partial L}{\partial a_{i}\^{\left(3\right)}} & = & -y_{i}\frac{1}{a_{i}\^{\left(3\right)}}\\ \Rightarrow\delta\^{\left(a\^{\left(3\right)}\right)} & = & -\frac{y}{a\^{\left(3\right)}}\end{aligned}

\begin{aligned} \frac{\partial L}{\partial a_{i}\^{\left(3\right)}} & = & -y_{i}\frac{1}{a_{i}\^{\left(3\right)}}\Rightarrow\delta\^{\left(a\^{\left(3\right)}\right)}=-\frac{y}{a\^{\left(3\right)}}\end{aligned}

\begin{aligned} \frac{\partial L}{\partial z_{i}\^{\left(3\right)}} & = & -\sum_{j\neq i}y_{j}\frac{1}{a_{j}\^{\left(3\right)}}\left(-a_{j}\^{\left(3\right)}a_{i}\^{\left(3\right)}\right)-y_{i}\frac{1}{a_{i}\^{\left(3\right)}}a_{i}\^{\left(3\right)}\left(1-a_{i}\^{\left(3\right)}\right)=\sum_{j\neq i}y_{j}a_{i}\^{\left(3\right)}-y_{i}+y_{i}a_{i}\^{\left(3\right)}=a_{i}\^{\left(3\right)}-y_{i}\end{aligned}

\begin{aligned} \Rightarrow\delta\^{\left(z\^{\left(3\right)}\right)}=\frac{\partial L}{\partial z\^{\left(3\right)}} & = & a\^{\left(3\right)}-y\in R\^{n_{3}\times1}\end{aligned}

\begin{aligned} \nabla_{W\^{\left(2\right)}}L & = & \frac{\partial L}{\partial W\^{\left(2\right)}}=\frac{\partial L}{\partial z\^{\left(3\right)}}\frac{\partial z\^{\left(3\right)}}{\partial W\^{\left(2\right)}}=a\^{\left(2\right)}\delta\^{\left(z\^{\left(3\right)}\right)}\in R\^{n_{2}\times n_{3}}\end{aligned}

\begin{aligned} \nabla_{b\^{\left(2\right)}}L & = & \frac{\partial L}{\partial b\^{\left(2\right)}}=\frac{\partial L}{\partial z\^{\left(3\right)}}\frac{\partial z\^{\left(3\right)}}{\partial b\^{\left(2\right)}}=\delta\^{\left(z\^{\left(3\right)}\right)}\in R\^{n_{3}\times1}\end{aligned}

\begin{aligned} \delta\^{\left(a\^{\left(2\right)}\right)} & = & \frac{\partial L}{\partial a\^{\left(2\right)}}=\frac{\partial L}{\partial z\^{\left(3\right)}}\frac{\partial z\^{\left(3\right)}}{\partial a\^{\left(2\right)}}=W\^{\left(2\right)}\delta\^{\left(z\^{\left(3\right)}\right)}\in R\^{n_{2}\times1}\end{aligned}

From the \delta\^{\left(a\^{\left(2\right)}\right)}\in R\^{n_{2}\times1}, we could get \delta\^{\left(ap_{j}\^{\left(1\right)}\right)}\in R\^{\left(\frac{H-h+1}{poolsize}\right)\times\left(\frac{W-w+1}{poolsize}\right)}

Then, we have

\begin{aligned} \delta\^{\left(zp_{j}\^{\left(1\right)}\right)} & = & \delta\^{\left(ap_{j}\^{\left(1\right)}\right)}\end{aligned}

then, we upsample the error sensitity and get

\begin{aligned} \delta\^{\left(ac_{j}\^{\left(1\right)}\right)} & = & up\left(\delta\^{\left(zp_{j}\^{\left(1\right)}\right)}\right)\in R\^{\left(H-h+1\right)\times\left(W-w+1\right)}\end{aligned}

then, we get

\begin{aligned} \delta\^{\left(zc_{j}\^{\left(1\right)}\right)} & = & \frac{\partial L}{\partial zc_{j}\^{\left(1\right)}}=\frac{\partial L}{\partial ac_{j}\^{\left(1\right)}}\frac{\partial ac_{j}\^{\left(1\right)}}{\partial zc_{j}\^{\left(1\right)}}=\delta\^{\left(ac_{j}\^{\left(1\right)}\right)}\circ ac_{j}\^{\left(1\right)}\circ\left(1-ac_{j}\^{\left(1\right)}\right)\in R\^{\left(H-h+1\right)\times\left(W-w+1\right)}\end{aligned}

then, we get the following gradients

\begin{aligned} \nabla_{k_{ij}\^{\left(1\right)}}L & = & \frac{\partial L}{\partial zc_{j}\^{\left(1\right)}}\frac{\partial zc_{j}\^{\left(1\right)}}{\partial k_{ij}\^{\left(1\right)}}=rot180\left(conv2\left(a_{i}\^{\left(1\right)},rot180\left(\delta\^{\left(ac_{j}\^{\left(1\right)}\right)}\right),'valid'\right)\right)\in R\^{h\times w}\end{aligned}

\begin{aligned} \nabla_{b_{j}\^{\left(1\right)}}L & = & \frac{\partial L}{\partial zc_{j}\^{\left(1\right)}}\frac{\partial ac_{j}\^{\left(1\right)}}{\partial b_{j}\^{\left(1\right)}}=\sum_{u,v}\left(\delta\^{\left(ac_{j}\^{\left(1\right)}\right)}\right)_{u,v}\in R\^{1}\end{aligned}

I-C1-MP1-C2-MP2-FC1-O

Suppose the net structure is I-C1-MP1-C2-MP2-FC1-O. For I1-C1-MP1, we have j=1,\cdots,F_{1} where F_{1} is the number of convolution feature maps, and

\begin{aligned} a\^{\left(1\right)} & = & x\in R\^{H_{1}\times W_{1}\times B_{1}}\end{aligned}

\begin{aligned} zc_{j}\^{\left(1\right)} & = & \sum_{i=1}\^{B_{1}}a_{i}\^{\left(1\right)}\star k_{ij}\^{\left(1\right)}+b_{j}\^{\left(1\right)}\in R\^{\left(H_{1}-h_{1}+1\right)\times\left(W_{1}-w_{1}+1\right)}\end{aligned}

\begin{aligned} ac_{j}\^{\left(1\right)} & = & f\left(zc_{j}\^{\left(1\right)}\right)\in R\^{\left(H_{1}-h_{1}+1\right)\times\left(W_{1}-w_{1}+1\right)}\end{aligned}

\begin{aligned} zp_{j}\^{\left(1\right)} & = & maxpool\left(ac_{j}\^{\left(1\right)}\right)\in R\^{\left(\frac{H_{1}-h_{1}+1}{poolsize_{1}}\right)\times\left(\frac{W_{1}-w_{1}+1}{poolsize_{1}}\right)}\end{aligned}

\begin{aligned} ap_{j}\^{\left(1\right)} & = & zp_{j}\^{\left(1\right)}\in R\^{\left(\frac{H_{1}-h_{1}+1}{poolsize_{1}}\right)\times\left(\frac{W_{1}-w_{1}+1}{poolsize_{1}}\right)}\end{aligned}

For I2-C2-MP2, we haveH_{2}=\frac{H_{1}-h_{1}+1}{poolsize_{1}}, W_{2}=\frac{W_{1}-w_{1}+1}{poolsize_{1}}, B_{2}=F_{1}, j=1,\cdots,F_{2} where F_{2} is the number of convolution feature maps, and

\begin{aligned} a\^{\left(2\right)} & = & ap\^{\left(1\right)}\in R\^{H_{2}\times W_{2}\times B_{2}}\end{aligned}

\begin{aligned} zc_{j}\^{\left(2\right)} & = & \sum_{i=1}\^{B_{2}}a_{i}\^{\left(2\right)}\star k_{ij}\^{\left(2\right)}+b_{j}\^{\left(2\right)}\in R\^{\left(H_{2}-h_{2}+1\right)\times\left(W_{2}-w_{2}+1\right)}\end{aligned}

\begin{aligned} ac_{j}\^{\left(2\right)} & = & f\left(zc_{j}\^{\left(2\right)}\right)\in R\^{\left(H_{2}-h_{2}+1\right)\times\left(W_{2}-w_{2}+1\right)}\end{aligned}

\begin{aligned} zp_{j}\^{\left(2\right)} & = & maxpool\left(ac_{j}\^{\left(2\right)}\right)\in R\^{\left(\frac{H_{2}-h_{2}+1}{poolsize_{2}}\right)\times\left(\frac{W_{2}-w_{2}+1}{poolsize_{2}}\right)}\end{aligned}

\begin{aligned} ap_{j}\^{\left(2\right)} & = & zp_{j}\^{\left(2\right)}\in R\^{\left(\frac{H_{2}-h_{2}+1}{poolsize_{2}}\right)\times\left(\frac{W_{2}-w_{2}+1}{poolsize_{2}}\right)}\end{aligned}

For I3-FC1-O, we have n_{3}=\left(\frac{H_{2}-h_{2}+1}{poolsize_{2}}\right)\times\left(\frac{W_{2}-w_{2}+1}{poolsize_{2}}\right)\times F_{2}

\begin{aligned} a\^{\left(3\right)} & = & reshape\left(ap\^{\left(2\right)}\right)\in R\^{n_{3}\times1}\end{aligned}

\begin{aligned} z\^{\left(4\right)} & = & W\^{\left(3\right)T}a\^{\left(3\right)}+b\^{\left(3\right)},\quad W\^{\left(3\right)}\in R\^{n_{3}\times n_{4}},b\^{\left(3\right)}\in R\^{n_{4}\times1}\end{aligned}

For Euclidean loss

\begin{aligned} a\^{\left(4\right)} & = & f\left(z\^{\left(4\right)}\right)\in R\^{n_{4}\times1}\end{aligned}

\begin{aligned} L & = & \frac{1}{2}\left|\left|y-a\^{\left(4\right)}\right|\right|_{2}\^{2}\end{aligned}

For cross-entropy loss

\begin{aligned} a_{i}\^{\left(4\right)} & = & \frac{\exp\left(z_{i}\^{\left(4\right)}\right)}{\sum_{k=1}\^{C}\exp\left(z_{k}\^{\left(4\right)}\right)}\end{aligned}

\begin{aligned} \frac{\partial a_{i}\^{\left(4\right)}}{\partial z_{i}\^{\left(4\right)}} & = & \frac{\exp\left(z_{i}\^{\left(4\right)}\right)\sum_{k=1}\^{C}\exp\left(z_{k}\^{\left(4\right)}\right)-\exp\left(z_{i}\^{\left(4\right)}\right)\exp\left(z_{i}\^{\left(4\right)}\right)}{\left(\sum_{k=1}\^{C}\exp\left(z_{k}\^{\left(4\right)}\right)\right)\^{2}}=a_{i}\^{\left(4\right)}\left(1-a_{i}\^{\left(4\right)}\right)\end{aligned}

\begin{aligned} \frac{\partial a_{j}\^{\left(4\right)}}{\partial z_{i}\^{\left(4\right)}} & = & \frac{-\exp\left(z_{j}\^{\left(4\right)}\right)\exp\left(z_{i}\^{\left(4\right)}\right)}{\left(\sum_{k=1}\^{C}\exp\left(z_{k}\^{\left(4\right)}\right)\right)\^{2}}=-a_{j}\^{\left(4\right)}a_{i}\^{\left(4\right)},\quad j\neq i\end{aligned}

\begin{aligned} L & = & -\sum_{i=1}\^{C}y_{i}\log\left(a_{i}\^{\left(4\right)}\right)=-\sum_{j\neq i}y_{j}\log\left(a_{j}\^{\left(4\right)}\right)-y_{i}\log\left(a_{i}\^{\left(4\right)}\right)\end{aligned}

\begin{aligned} \frac{\partial L}{\partial a_{i}\^{\left(4\right)}} & = & -\frac{y_{i}}{a_{i}\^{\left(4\right)}}\end{aligned}

\begin{aligned} \Rightarrow\delta\^{\left(a\^{\left(4\right)}\right)} & = & -\frac{y}{a\^{\left(4\right)}}\end{aligned}

\begin{aligned} \frac{\partial L}{\partial z_{i}\^{\left(4\right)}} & = & -\sum_{j\neq i}y_{j}\frac{1}{a_{j}\^{\left(4\right)}}\left(-a_{j}\^{\left(4\right)}a_{i}\^{\left(4\right)}\right)-y_{i}\frac{1}{a_{i}\^{\left(4\right)}}a_{i}\^{\left(4\right)}\left(1-a_{i}\^{\left(4\right)}\right)=\sum_{j\neq i}y_{j}a_{i}\^{\left(4\right)}-y_{i}+y_{i}a_{i}\^{\left(4\right)}=a_{i}\^{\left(4\right)}-y_{i}\end{aligned}

\begin{aligned} \Rightarrow & \delta\^{\left(z\^{\left(4\right)}\right)}= & \frac{\partial L}{\partial z\^{\left(4\right)}}=a\^{\left(4\right)}-y\in R\^{n_{4}\times1}\end{aligned}

\begin{aligned} \delta\^{\left(a\^{\left(3\right)}\right)} & = & \frac{\partial L}{\partial a\^{\left(3\right)}}=\frac{\partial L}{\partial z\^{\left(4\right)}}\frac{\partial z\^{\left(4\right)}}{\partial a\^{\left(3\right)}}=W\^{\left(3\right)}\delta\^{\left(z\^{\left(4\right)}\right)}\in R\^{n_{3}\times1}\end{aligned}

\begin{aligned} \nabla_{W\^{\left(3\right)}}L & = & \frac{\partial L}{\partial W\^{\left(3\right)}}=\frac{\partial L}{\partial z\^{\left(4\right)}}\frac{\partial z\^{\left(4\right)}}{\partial W\^{\left(3\right)}}=a\^{\left(3\right)T}\delta\^{\left(z\^{\left(4\right)}\right)}\in R\^{n_{3}\times n_{4}}\end{aligned}

\begin{aligned} \nabla_{b\^{\left(3\right)}}L & = & \frac{\partial L}{\partial b\^{\left(3\right)}}=\frac{\partial L}{\partial z\^{\left(4\right)}}\frac{\partial z\^{\left(4\right)}}{\partial b\^{\left(3\right)}}=\delta\^{\left(z\^{\left(4\right)}\right)}\in R\^{n_{4}\times1}\end{aligned}

Here, \delta\^{\left(a\^{\left(3\right)}\right)}\in R\^{n_{3}\times1} is the error sensitivity of the reshaped output of the second maxpooling layer. Thus we have get the \delta\^{\left(ap_{j}\^{\left(2\right)}\right)}, then we get

\begin{aligned} \delta\^{\left(zp_{j}\^{\left(2\right)}\right)} & = & \delta\^{\left(ap_{j}\^{\left(2\right)}\right)}\end{aligned}

then, we upsample the error sensitivity and get

\begin{aligned} \delta\^{\left(ac_{j}\^{\left(1\right)}\right)} & = & up\left(\delta\^{\left(zp_{j}\^{\left(2\right)}\right)}\right)\in R\^{\left(H_{2}-k_{2}+1\right)\times\left(W_{2}-w_{2}+1\right)}\end{aligned}

then, we continue backpropagate the error sensitity

\begin{aligned} \delta\^{\left(zc_{j}\^{\left(2\right)}\right)} & = & \delta\^{\left(ac_{j}\^{\left(2\right)}\right)}\circ ac_{j}\^{\left(2\right)}\circ\left(1-ac_{j}\^{\left(2\right)}\right)\in R\^{\left(H_{2}-h_{2}+1\right)\times\left(W_{2}-w_{2}+1\right)}\end{aligned}

then, we use the following operations to get the error sensity of a_{i}\^{\left(2\right)}

\begin{aligned} \delta\^{\left(a_{i}\^{\left(2\right)}\right)} & = & \frac{\partial L}{\partial a_{i}\^{\left(2\right)}}=\sum_{j=1}\^{F_{2}}\frac{\partial L}{\partial zc_{j}\^{\left(2\right)}}\frac{\partial zc_{j}\^{\left(2\right)}}{\partial a_{i}\^{\left(2\right)}}=\sum_{j=1}\^{F_{2}}conv2\left(\delta\^{\left(zc_{j}\^{\left(2\right)}\right)},\ rot180\left(k_{ij}\^{\left(2\right)}\right),\ 'full'\right)\in R\^{H_{2}\times W_{2}}\end{aligned}

\begin{aligned} \nabla_{k_{ij}\^{\left(2\right)}}L & = & \frac{\partial L}{\partial k_{ij}\^{\left(2\right)}}=\frac{\partial L}{\partial zc_{j}\^{\left(2\right)}}\frac{\partial zc_{j}\^{\left(2\right)}}{\partial k_{ij}\^{\left(2\right)}}=rot180\left(conv2\left(a_{i}\^{\left(2\right)},rot180\left(\delta\^{\left(ac_{j}\^{\left(2\right)}\right)}\right),'valid'\right)\right)\in R\^{h_{2}\times w_{2}}\end{aligned}

\begin{aligned} \nabla_{b_{j}\^{\left(2\right)}}L & = & \frac{\partial L}{\partial b_{j}\^{\left(2\right)}}=\frac{\partial L}{\partial zc_{j}\^{\left(2\right)}}\frac{\partial zc_{j}\^{\left(2\right)}}{\partial b_{j}\^{\left(2\right)}}=\sum_{u,v}\left(\delta\^{\left(ac_{j}\^{\left(2\right)}\right)}\right)_{u,v}\in R\^{1}\end{aligned}

From \delta\^{\left(a_{i}\^{\left(2\right)}\right)}\in R\^{H_{2}\times W_{2}}, we could get \delta\^{\left(ap_{j}\^{\left(1\right)}\right)}\in R\^{\left(\frac{H_{1}-h_{1}+1}{poolsize_{1}}\right)\times\left(\frac{W_{1}-w_{1}+1}{poolsize_{1}}\right)}, then

\begin{aligned} \delta\^{\left(zp_{j}\^{\left(1\right)}\right)} & = & \delta\^{\left(ap_{j}\^{\left(1\right)}\right)}\end{aligned}

then, we upsample the error sensitity and have

\begin{aligned} \delta\^{\left(ac_{j}\^{\left(1\right)}\right)} & = & up\left(\delta\^{\left(zp_{j}\^{\left(1\right)}\right)}\right)\in R\^{\left(H_{1}-h_{1}+1\right)\times\left(W_{1}-w_{1}+1\right)}\end{aligned}

then, we continue backpropagate the error sensitity

\begin{aligned} \delta\^{\left(zc_{j}\^{\left(1\right)}\right)} & = & \delta\^{\left(ac_{j}\^{\left(1\right)}\right)}\circ ac_{j}\^{\left(1\right)}\circ\left(1-ac_{j}\^{\left(1\right)}\right)\in R\^{\left(H_{1}-h_{1}+1\right)\times\left(W_{1}-w_{1}+1\right)}\end{aligned}

then, we use the following operations to get the error sensity of a_{i}\^{\left(1\right)}

\begin{aligned} \delta\^{\left(a_{i}\^{\left(1\right)}\right)} & = & \frac{\partial L}{\partial a_{i}\^{\left(1\right)}}=\sum_{j=1}\^{F_{1}}\frac{\partial L}{\partial zc_{j}\^{\left(1\right)}}\frac{\partial zc_{j}\^{\left(1\right)}}{\partial a_{i}\^{\left(1\right)}}=\sum_{j=1}\^{F_{1}}conv2\left(\delta\^{\left(zc_{j}\^{\left(1\right)}\right)},\ rot180\left(k_{ij}\^{\left(1\right)}\right),\ 'full'\right)\in R\^{H_{1}\times W_{1}}\end{aligned}

\begin{aligned} \nabla_{k_{ij}\^{\left(1\right)}}L & = & \frac{\partial L}{\partial k_{ij}\^{\left(1\right)}}=\frac{\partial L}{\partial zc_{j}\^{\left(1\right)}}\frac{\partial zc_{j}\^{\left(1\right)}}{\partial k_{ij}\^{\left(1\right)}}=rot180\left(conv2\left(a_{i}\^{\left(1\right)},rot180\left(\delta\^{\left(zc_{j}\^{\left(1\right)}\right)}\right),'valid'\right)\right)\in R\^{h_{1}\times w_{1}}\end{aligned}

\begin{aligned} \nabla_{b_{j}\^{\left(1\right)}}L & = & \frac{\partial L}{\partial b_{j}\^{\left(1\right)}}=\frac{\partial L}{\partial zc_{j}\^{\left(1\right)}}\frac{\partial zc_{j}\^{\left(1\right)}}{\partial b_{j}\^{\left(1\right)}}=\sum_{u,v}\left(\delta\^{\left(ac_{j}\^{\left(1\right)}\right)}\right)_{u,v}\in R\^{1}\end{aligned}