content.tex 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195
  1. %!TEX root = main.tex
  2. \section{Introduction}
  3. Artificial neural networks have dozends of hyperparameters which influence
  4. their behaviour during training and evaluation time. One parameter is the
  5. choice of activation functions. While in principle every neuron could have a
  6. different activation function, in practice networks only use two activation
  7. functions: The softmax function for the output layer in order to obtain a
  8. probability distribution over the possible classes and one activation function
  9. for all other neurons.
  10. Activation functions should have the following properties:
  11. \begin{itemize}
  12. \item \textbf{Non-linearity}: A linear activation function in a simple feed
  13. forward network leads to a linear function. This means no matter how
  14. many layers the network uses, there is an equivalent network with
  15. only the input and the output layer. Please note that \glspl{CNN} are
  16. different. Padding and pooling are also non-linear operations.
  17. \item \textbf{Differentiability}: Activation functions need to be
  18. differentiable in order to be able to apply gradient descent. It is
  19. not necessary that they are differentiable at any point. In practice,
  20. the gradient at non-differentiable points can simply be set to zero
  21. in order to prevent weight updates at this point.
  22. \item \textbf{Non-zero gradient}: The sign function is not suitable for
  23. gradient descent based optimizers as its gradient is zero at all
  24. differentiable points. An activation function should have infinitely
  25. many points with non-zero gradient.
  26. \end{itemize}
  27. One of the simplest and most widely used activation functions for \glspl{CNN}
  28. is \gls{ReLU}~\cite{AlexNet-2012}, but others such as
  29. \gls{ELU}~\cite{clevert2015fast}, \gls{PReLU}~\cite{he2015delving}, softplus~\cite{7280459}
  30. and softsign~\cite{bergstra2009quadratic} have been proposed.
  31. Activation functions differ in the range of values and the derivative. The
  32. definitions and other comparisons of eleven activation functions are given
  33. in~\cref{table:activation-functions-overview}.
  34. \section{Important Differences of Proposed Activation Functions}
  35. Theoretical explanations why one activation function is preferable to another
  36. in some scenarios are the following:
  37. \begin{itemize}
  38. \item \textbf{Vanishing Gradient}: Activation functions like tanh and the
  39. logistic function saturate outside of the interval $[-5, 5]$. This
  40. means weight updates are very small for preceding neurons, which is
  41. especially a problem for very deep or recurrent networks as described
  42. in~\cite{bengio1994learning}. Even if the neurons learn eventually,
  43. learning is slower~\cite{AlexNet-2012}.
  44. \item \textbf{Dying ReLU}: The dying \gls{ReLU} problem is similar to the
  45. vanishing gradient problem. The gradient of the \gls{ReLU} function
  46. is~0 for all non-positive values. This means if all elements of the
  47. training set lead to a negative input for one neuron at any point in
  48. the training process, this neuron does not get any update and hence
  49. does not participate in the training process. This problem is
  50. addressed in~\cite{maas2013rectifier}.
  51. \item \textbf{Mean unit activation}: Some publications
  52. like~\cite{clevert2015fast,BatchNormalization-2015} claim that mean
  53. unit activations close to 0 are desirable. They claim that this
  54. speeds up learning by reducing the bias shift effect. The speedup
  55. of learning is supported by many experiments. Hence the possibility
  56. of negative activations is desirable.
  57. \end{itemize}
  58. Those considerations are listed
  59. in~\cref{table:properties-of-activation-functions} for 11~activation functions.
  60. Besides the theoretical properties, empiric results are provided
  61. in~\cref{table:CIFAR-100-accuracies-activation-functions,table:CIFAR-100-timing-activation-functions}.
  62. The baseline network was adjusted so that every activation function except the
  63. one of the output layer was replaced by one of the 11~activation functions.
  64. As expected, \gls{PReLU} and \gls{ELU} performed best. Unexpected was that the
  65. logistic function, tanh and softplus performed worse than the identity and it
  66. is unclear why the pure-softmax network performed so much better than the
  67. logistic function.
  68. One hypothesis why the logistic function performs so bad is that it cannot
  69. produce negative outputs. Hence the logistic$^-$ function was developed:
  70. \[\text{logistic}^{-}(x) = \frac{1}{1+ e^{-x}} - 0.5\]
  71. The logistic$^-$ function has the same derivative as the logistic function and
  72. hence still suffers from the vanishing gradient problem.
  73. The network with the logistic$^-$ function achieves an accuracy which is
  74. \SI{11.30}{\percent} better than the network with the logistic function, but is
  75. still \SI{5.54}{\percent} worse than the \gls{ELU}.
  76. Similarly, \gls{ReLU} was adjusted to have a negative output:
  77. \[\text{ReLU}^{-}(x) = \max(-1, x) = \text{ReLU}(x+1) - 1\]
  78. The results of \gls{ReLU}$^-$ are much worse on the training set, but perform
  79. similar on the test set. The result indicates that the possibility of hard zero
  80. and thus a sparse representation is either not important or similar important as
  81. the possibility to produce negative outputs. This
  82. contradicts~\cite{glorot2011deep,srivastava2014understanding}.
  83. A key difference between the logistic$^-$ function and \gls{ELU} is that
  84. \gls{ELU} does neither suffers from the vanishing gradient problem nor is its
  85. range of values bound. For this reason, the S2ReLU activation function, defined
  86. as
  87. \begin{align*}
  88. \StwoReLU(x) &= \ReLU \left (\frac{x}{2} + 1 \right ) - \ReLU \left (-\frac{x}{2} + 1 \right)\\
  89. &=
  90. \begin{cases}-\frac{x}{2} + 1 &\text{if } x \le -2\\
  91. x &\text{if } -2\le x \le 2\\
  92. \frac{x}{2} + 1&\text{if } x > -2\end{cases}
  93. \end{align*}
  94. This function is similar to SReLUs as introduced in~\cite{jin2016deep}. The
  95. difference is that S2ReLU does not introduce learnable parameters. The S2ReLU
  96. was designed to be symmetric, be the identity close to zero and have a smaller
  97. absolute value than the identity farther away. It is easy to compute and easy to
  98. implement.
  99. Those results --- not only the absolute values, but also the relative
  100. comparison --- might depend on the network architecture, the training
  101. algorithm, the initialization and the dataset. Results for MNIST can be found
  102. in~\cref{table:MNIST-accuracies-activation-functions} and for HASYv2
  103. in~\cref{table:HASYv2-accuracies-activation-functions}. For both datasets, the
  104. logistic function has a much shorter training time and a noticeably lower test
  105. accuracy.
  106. \glsunset{LReLU}
  107. \begin{table}[H]
  108. \centering
  109. \begin{tabular}{lccc}
  110. \toprule
  111. \multirow{2}{*}{Function} & Vanishing & Negative Activation & Bound \\
  112. & Gradient & possible & activation \\\midrule
  113. Identity & \cellcolor{green!25}No & \cellcolor{green!25} Yes & \cellcolor{green!25}No \\
  114. Logistic & \cellcolor{red!25} Yes & \cellcolor{red!25} No & \cellcolor{red!25} Yes \\
  115. Logistic$^-$ & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\
  116. Softmax & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\
  117. tanh & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\
  118. Softsign & \cellcolor{red!25} Yes & \cellcolor{green!25}Yes & \cellcolor{red!25} Yes \\
  119. ReLU & \cellcolor{yellow!25}Yes\footnotemark & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\
  120. Softplus & \cellcolor{green!25}No & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\
  121. S2ReLU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\
  122. \gls{LReLU}/PReLU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\
  123. ELU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\
  124. \bottomrule
  125. \end{tabular}
  126. \caption[Activation function properties]{Properties of activation functions.}
  127. \label{table:properties-of-activation-functions}
  128. \end{table}
  129. \footnotetext{The dying ReLU problem is similar to the vanishing gradient problem.}
  130. \begin{table}[H]
  131. \centering
  132. \begin{tabular}{lccclllll}
  133. \toprule
  134. \multirow{2}{*}{Function} & \multicolumn{2}{c}{Inference per} & Training & \multirow{2}{*}{Epochs} & Mean total \\\cline{2-3}
  135. & 1 Image & 128 & time & & training time \\\midrule
  136. Identity & \SI{8}{\milli\second} & \SI{42}{\milli\second} & \SI{31}{\second\per\epoch} & 108 -- \textbf{148} &\SI{3629}{\second} \\
  137. Logistic & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch} & \textbf{101} -- 167 &\textbf{\SI{2234}{\second}} \\
  138. Logistic$^-$ & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \textbf{\SI{22}{\second\per\epoch}} & 133 -- 255 &\SI{3421}{\second} \\
  139. Softmax & \SI{7}{\milli\second} & \SI{37}{\milli\second} & \SI{33}{\second\per\epoch} & 127 -- 248 &\SI{5250}{\second} \\
  140. Tanh & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 125 -- 211 &\SI{3141}{\second} \\
  141. Softsign & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 122 -- 205 &\SI{3505}{\second} \\
  142. \gls{ReLU} & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 118 -- 192 &\SI{3449}{\second} \\
  143. Softplus & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch} & \textbf{101} -- 165 &\SI{2718}{\second} \\
  144. S2ReLU & \textbf{\SI{5}{\milli\second}} & \SI{32}{\milli\second} & \SI{26}{\second\per\epoch} & 108 -- 209 &\SI{3231}{\second} \\
  145. \gls{LReLU} & \SI{7}{\milli\second} & \SI{34}{\milli\second} & \SI{25}{\second\per\epoch} & 109 -- 198 &\SI{3388}{\second} \\
  146. \gls{PReLU} & \SI{7}{\milli\second} & \SI{34}{\milli\second} & \SI{28}{\second\per\epoch} & 131 -- 215 &\SI{3970}{\second} \\
  147. \gls{ELU} & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 146 -- 232 &\SI{3692}{\second} \\
  148. \bottomrule
  149. \end{tabular}
  150. \caption[Activation function timing results on CIFAR-100]{Training time and
  151. inference time of adjusted baseline models trained with different
  152. activation functions on GTX~970 \glspl{GPU} on CIFAR-100. It was
  153. expected that the identity is the fastest function. This result is
  154. likely an implementation specific problem of Keras~2.0.4 or
  155. Tensorflow~1.1.0.}
  156. \label{table:CIFAR-100-timing-activation-functions}
  157. \end{table}
  158. \begin{table}[H]
  159. \centering
  160. \begin{tabular}{lccccc}
  161. \toprule
  162. \multirow{2}{*}{Function} & \multicolumn{2}{c}{Single model} & Ensemble & \multicolumn{2}{c}{Epochs}\\\cline{2-3}\cline{5-6}
  163. & Accuracy & std & Accuracy & Range & Mean \\\midrule
  164. Identity & \SI{99.45}{\percent} & $\sigma=0.09$ & \SI{99.63}{\percent} & 55 -- \hphantom{0}77 & 62.2\\%TODO: Really?
  165. Logistic & \SI{97.27}{\percent} & $\sigma=2.10$ & \SI{99.48}{\percent} & \textbf{37} -- \hphantom{0}76 & \textbf{54.5}\\
  166. Softmax & \SI{99.60}{\percent} & $\boldsymbol{\sigma=0.03}$& \SI{99.63}{\percent} & 44 -- \hphantom{0}73 & 55.6\\
  167. Tanh & \SI{99.40}{\percent} & $\sigma=0.09$ & \SI{99.57}{\percent} & 56 -- \hphantom{0}80 & 67.6\\
  168. Softsign & \SI{99.40}{\percent} & $\sigma=0.08$ & \SI{99.57}{\percent} & 72 -- 101 & 84.0\\
  169. \gls{ReLU} & \textbf{\SI{99.62}{\percent}} & \textbf{$\sigma=0.04$} & \textbf{\SI{99.73}{\percent}} & 51 -- \hphantom{0}94 & 71.7\\
  170. Softplus & \SI{99.52}{\percent} & $\sigma=0.05$ & \SI{99.62}{\percent} & 62 -- \hphantom{0}\textbf{70} & 68.9\\
  171. \gls{PReLU} & \SI{99.57}{\percent} & $\sigma=0.07$ & \textbf{\SI{99.73}{\percent}} & 44 -- \hphantom{0}89 & 71.2\\
  172. \gls{ELU} & \SI{99.53}{\percent} & $\sigma=0.06$ & \SI{99.58}{\percent} & 45 -- 111 & 72.5\\
  173. \bottomrule
  174. \end{tabular}
  175. \caption[Activation function evaluation results on MNIST]{Test accuracy of
  176. adjusted baseline models trained with different activation
  177. functions on MNIST.}
  178. \label{table:MNIST-accuracies-activation-functions}
  179. \end{table}
  180. \glsreset{LReLU}