content.tex 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176
  1. %!TEX root = main.tex
  2. \section{Introduction}
  3. TODO\cite{Thoma:2014}
  4. \section{Terminology}
  5. TODO
  6. \section{Activation Functions}
  7. Nonlinear, differentiable activation functions are important for neural
  8. networks to allow them to learn nonlinear decision boundaries. One of the
  9. simplest and most widely used activation functions for \glspl{CNN} is
  10. \gls{ReLU}~\cite{AlexNet-2012}, but others such as
  11. \gls{ELU}~\cite{clevert2015fast}, \gls{PReLU}~\cite{he2015delving}, softplus~\cite{7280459}
  12. and softsign~\cite{bergstra2009quadratic} have been proposed. The baseline uses
  13. \gls{ELU}.
  14. Activation functions differ in the range of values and the derivative. The
  15. definitions and other comparisons of eleven activation functions are given
  16. in~\cref{table:activation-functions-overview}.
  17. Theoretical explanations why one activation function is preferable to another
  18. in some scenarios are the following:
  19. \begin{itemize}
  20. \item \textbf{Vanishing Gradient}: Activation functions like tanh and the
  21. logistic function saturate outside of the interval $[-5, 5]$. This
  22. means weight updates are very small for preceding neurons, which is
  23. especially a problem for very deep or recurrent networks as described
  24. in~\cite{bengio1994learning}. Even if the neurons learn eventually,
  25. learning is slower~\cite{AlexNet-2012}.
  26. \item \textbf{Dying ReLU}: The dying \gls{ReLU} problem is similar to the
  27. vanishing gradient problem. The gradient of the \gls{ReLU} function
  28. is~0 for all non-positive values. This means if all elements of the
  29. training set lead to a negative input for one neuron at any point in
  30. the training process, this neuron does not get any update and hence
  31. does not participate in the training process. This problem is
  32. addressed in~\cite{maas2013rectifier}.
  33. \item \textbf{Mean unit activation}: Some publications
  34. like~\cite{clevert2015fast,BatchNormalization-2015} claim that mean
  35. unit activations close to 0 are desirable. They claim that this
  36. speeds up learning by reducing the bias shift effect. The speedup
  37. of learning is supported by many experiments. Hence the possibility
  38. of negative activations is desirable.
  39. \end{itemize}
  40. Those considerations are listed
  41. in~\cref{table:properties-of-activation-functions} for 11~activation functions.
  42. Besides the theoretical properties, empiric results are provided
  43. in~\cref{table:CIFAR-100-accuracies-activation-functions,table:CIFAR-100-timing-activation-functions}.
  44. The baseline network was adjusted so that every activation function except the
  45. one of the output layer was replaced by one of the 11~activation functions.
  46. As expected, \gls{PReLU} and \gls{ELU} performed best. Unexpected was that the
  47. logistic function, tanh and softplus performed worse than the identity and it
  48. is unclear why the pure-softmax network performed so much better than the
  49. logistic function.
  50. One hypothesis why the logistic function performs so bad is that it cannot
  51. produce negative outputs. Hence the logistic$^-$ function was developed:
  52. \[\text{logistic}^{-}(x) = \frac{1}{1+ e^{-x}} - 0.5\]
  53. The logistic$^-$ function has the same derivative as the logistic function and
  54. hence still suffers from the vanishing gradient problem.
  55. The network with the logistic$^-$ function achieves an accuracy which is
  56. \SI{11.30}{\percent} better than the network with the logistic function, but is
  57. still \SI{5.54}{\percent} worse than the \gls{ELU}.
  58. Similarly, \gls{ReLU} was adjusted to have a negative output:
  59. \[\text{ReLU}^{-}(x) = \max(-1, x) = \text{ReLU}(x+1) - 1\]
  60. The results of \gls{ReLU}$^-$ are much worse on the training set, but perform
  61. similar on the test set. The result indicates that the possibility of hard zero
  62. and thus a sparse representation is either not important or similar important as
  63. the possibility to produce negative outputs. This
  64. contradicts~\cite{glorot2011deep,srivastava2014understanding}.
  65. A key difference between the logistic$^-$ function and \gls{ELU} is that
  66. \gls{ELU} does neither suffers from the vanishing gradient problem nor is its
  67. range of values bound. For this reason, the S2ReLU activation function, defined
  68. as
  69. \begin{align*}
  70. \StwoReLU(x) &= \ReLU \left (\frac{x}{2} + 1 \right ) - \ReLU \left (-\frac{x}{2} + 1 \right)\\
  71. &=
  72. \begin{cases}-\frac{x}{2} + 1 &\text{if } x \le -2\\
  73. x &\text{if } -2\le x \le 2\\
  74. \frac{x}{2} + 1&\text{if } x > -2\end{cases}
  75. \end{align*}
  76. This function is similar to SReLUs as introduced in~\cite{jin2016deep}. The
  77. difference is that S2ReLU does not introduce learnable parameters. The S2ReLU
  78. was designed to be symmetric, be the identity close to zero and have a smaller
  79. absolute value than the identity farther away. It is easy to compute and easy to
  80. implement.
  81. Those results --- not only the absolute values, but also the relative
  82. comparison --- might depend on the network architecture, the training
  83. algorithm, the initialization and the dataset. Results for MNIST can be found
  84. in~\cref{table:MNIST-accuracies-activation-functions} and for HASYv2
  85. in~\cref{table:HASYv2-accuracies-activation-functions}. For both datasets, the
  86. logistic function has a much shorter training time and a noticeably lower test
  87. accuracy.
  88. \begin{table}[H]
  89. \centering
  90. \begin{tabular}{lccc}
  91. \toprule
  92. \multirow{2}{*}{Function} & Vanishing & Negative Activation & Bound \\
  93. & Gradient & possible & activation \\\midrule
  94. Identity & \cellcolor{green!25}No & \cellcolor{green!25} Yes & \cellcolor{green!25}No \\
  95. Logistic & \cellcolor{red!25} Yes & \cellcolor{red!25} No & \cellcolor{red!25} Yes \\
  96. Logistic$^-$ & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\
  97. Softmax & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\
  98. tanh & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\
  99. Softsign & \cellcolor{red!25} Yes & \cellcolor{green!25}Yes & \cellcolor{red!25} Yes \\
  100. ReLU & \cellcolor{yellow!25}Yes\footnotemark & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\
  101. Softplus & \cellcolor{green!25}No & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\
  102. S2ReLU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\
  103. LReLU/PReLU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\
  104. ELU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\
  105. \bottomrule
  106. \end{tabular}
  107. \caption[Activation function properties]{Properties of activation functions.}
  108. \label{table:properties-of-activation-functions}
  109. \end{table}
  110. \footnotetext{The dying ReLU problem is similar to the vanishing gradient problem.}
  111. \glsunset{LReLU}
  112. \begin{table}[H]
  113. \centering
  114. \begin{tabular}{lccclllll}
  115. \toprule
  116. \multirow{2}{*}{Function} & \multicolumn{2}{c}{Inference per} & Training & \multirow{2}{*}{Epochs} & Mean total \\\cline{2-3}
  117. & 1 Image & 128 & time & & training time \\\midrule
  118. Identity & \SI{8}{\milli\second} & \SI{42}{\milli\second} & \SI{31}{\second\per\epoch} & 108 -- \textbf{148} &\SI{3629}{\second} \\
  119. Logistic & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch} & \textbf{101} -- 167 &\textbf{\SI{2234}{\second}} \\
  120. Logistic$^-$ & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \textbf{\SI{22}{\second\per\epoch}} & 133 -- 255 &\SI{3421}{\second} \\
  121. Softmax & \SI{7}{\milli\second} & \SI{37}{\milli\second} & \SI{33}{\second\per\epoch} & 127 -- 248 &\SI{5250}{\second} \\
  122. Tanh & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 125 -- 211 &\SI{3141}{\second} \\
  123. Softsign & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 122 -- 205 &\SI{3505}{\second} \\
  124. \gls{ReLU} & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 118 -- 192 &\SI{3449}{\second} \\
  125. Softplus & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch} & \textbf{101} -- 165 &\SI{2718}{\second} \\
  126. S2ReLU & \textbf{\SI{5}{\milli\second}} & \SI{32}{\milli\second} & \SI{26}{\second\per\epoch} & 108 -- 209 &\SI{3231}{\second} \\
  127. \gls{LReLU} & \SI{7}{\milli\second} & \SI{34}{\milli\second} & \SI{25}{\second\per\epoch} & 109 -- 198 &\SI{3388}{\second} \\
  128. \gls{PReLU} & \SI{7}{\milli\second} & \SI{34}{\milli\second} & \SI{28}{\second\per\epoch} & 131 -- 215 &\SI{3970}{\second} \\
  129. \gls{ELU} & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 146 -- 232 &\SI{3692}{\second} \\
  130. \bottomrule
  131. \end{tabular}
  132. \caption[Activation function timing results on CIFAR-100]{Training time and
  133. inference time of adjusted baseline models trained with different
  134. activation functions on GTX~970 \glspl{GPU} on CIFAR-100. It was
  135. expected that the identity is the fastest function. This result is
  136. likely an implementation specific problem of Keras~2.0.4 or
  137. Tensorflow~1.1.0.}
  138. \label{table:CIFAR-100-timing-activation-functions}
  139. \end{table}
  140. \begin{table}[H]
  141. \centering
  142. \begin{tabular}{lccccc}
  143. \toprule
  144. \multirow{2}{*}{Function} & \multicolumn{2}{c}{Single model} & Ensemble & \multicolumn{2}{c}{Epochs}\\\cline{2-3}\cline{5-6}
  145. & Accuracy & std & Accuracy & Range & Mean \\\midrule
  146. Identity & \SI{99.45}{\percent} & $\sigma=0.09$ & \SI{99.63}{\percent} & 55 -- \hphantom{0}77 & 62.2\\%TODO: Really?
  147. Logistic & \SI{97.27}{\percent} & $\sigma=2.10$ & \SI{99.48}{\percent} & \textbf{37} -- \hphantom{0}76 & \textbf{54.5}\\
  148. Softmax & \SI{99.60}{\percent} & $\boldsymbol{\sigma=0.03}$& \SI{99.63}{\percent} & 44 -- \hphantom{0}73 & 55.6\\
  149. Tanh & \SI{99.40}{\percent} & $\sigma=0.09$ & \SI{99.57}{\percent} & 56 -- \hphantom{0}80 & 67.6\\
  150. Softsign & \SI{99.40}{\percent} & $\sigma=0.08$ & \SI{99.57}{\percent} & 72 -- 101 & 84.0\\
  151. \gls{ReLU} & \textbf{\SI{99.62}{\percent}} & \textbf{$\sigma=0.04$} & \textbf{\SI{99.73}{\percent}} & 51 -- \hphantom{0}94 & 71.7\\
  152. Softplus & \SI{99.52}{\percent} & $\sigma=0.05$ & \SI{99.62}{\percent} & 62 -- \hphantom{0}\textbf{70} & 68.9\\
  153. \gls{PReLU} & \SI{99.57}{\percent} & $\sigma=0.07$ & \textbf{\SI{99.73}{\percent}} & 44 -- \hphantom{0}89 & 71.2\\
  154. \gls{ELU} & \SI{99.53}{\percent} & $\sigma=0.06$ & \SI{99.58}{\percent} & 45 -- 111 & 72.5\\
  155. \bottomrule
  156. \end{tabular}
  157. \caption[Activation function evaluation results on MNIST]{Test accuracy of
  158. adjusted baseline models trained with different activation
  159. functions on MNIST.}
  160. \label{table:MNIST-accuracies-activation-functions}
  161. \end{table}
  162. \glsreset{LReLU}