ch5-optimization-of-system-design.tex 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
  1. %!TEX root = write-math-ba-paper.tex
  2. \section{Optimization of System Design}\label{ch:Optimization-of-System-Design}
  3. In order to evaluate the effect of different preprocessing algorithms, features
  4. and adjustments in the \gls{MLP} training and topology, the following baseline
  5. system was used:
  6. Scale the recording to fit into a unit square while keeping the aspect ratio,
  7. shift it as described in \cref{sec:preprocessing},
  8. resample it with linear interpolation to get 20~points per stroke, spaced
  9. evenly in time. Take the first 4~strokes with 20~points per stroke and
  10. 2~coordinates per point as features, resulting in 160~features which is equal
  11. to the number of input neurons. If a recording has less than 4~strokes, the
  12. remaining features were filled with zeroes.
  13. All experiments were evaluated with four baseline systems $B_{hl=i}$, $i \in \Set{1,
  14. 2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
  15. could have a severe influence on the effect of new features or preprocessing
  16. steps. Each hidden layer in all evaluated systems has $500$ neurons.
  17. Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
  18. of $\alpha = 0.1$. The activation function of every neuron in a hidden layer is
  19. the sigmoid function. The neurons in the
  20. output layer use the softmax function. For every experiment, exactly one part
  21. of the baseline systems was changed.
  22. \subsection{Random Weight Initialization}
  23. The neural networks in all experiments got initialized with a small random
  24. weight
  25. \[w_{i,j} \sim U(-4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}}, 4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}})\]
  26. where $w_{i,j}$ is the weight between the neurons $i$ and $j$, $l$ is the layer
  27. of neuron $i$, and $n_i$ is the number of neurons in layer $i$. This random
  28. initialization was suggested on
  29. \cite{deeplearningweights} and is done to break symmetry.
  30. This can lead to different error rates for the same systems just because the
  31. initialization was different.
  32. In order to get an impression of the magnitude of the influence on the different
  33. topologies and error rates the baseline models were trained 5 times with
  34. random initializations.
  35. \Cref{table:baseline-systems-random-initializations-summary}
  36. shows a summary of the results. The more hidden layers are used, the more do
  37. the results vary between different random weight initializations.
  38. \begin{table}[h]
  39. \centering
  40. \begin{tabular}{crrr|rrr} %chktex 44
  41. \toprule
  42. \multirow{3}{*}{System} & \multicolumn{6}{c}{Classification error}\\
  43. \cmidrule(l){2-7}
  44. & \multicolumn{3}{c}{Top-1} & \multicolumn{3}{c}{Top-3}\\
  45. & Min & Max & Mean & Min & Max & Mean\\\midrule
  46. $B_{hl=1}$ & $\SI{23.1}{\percent}$ & $\SI{23.4}{\percent}$ & $\SI{23.2}{\percent}$ & $\SI{6.7}{\percent}$ & $\SI{6.8}{\percent}$ & $\SI{6.7}{\percent}$ \\
  47. $B_{hl=2}$ & \underline{$\SI{21.4}{\percent}$} & \underline{$\SI{21.8}{\percent}$}& \underline{$\SI{21.6}{\percent}$} & $\SI{5.7}{\percent}$ & \underline{$\SI{5.8}{\percent}$} & \underline{$\SI{5.7}{\percent}$}\\
  48. $B_{hl=3}$ & $\SI{21.5}{\percent}$ & $\SI{22.3}{\percent}$ & $\SI{21.9}{\percent}$ & \underline{$\SI{5.5}{\percent}$} & $\SI{5.8}{\percent}$ & \underline{$\SI{5.7}{\percent}$}\\
  49. $B_{hl=4}$ & $\SI{23.2}{\percent}$ & $\SI{24.8}{\percent}$ & $\SI{23.9}{\percent}$ & $\SI{6.0}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{6.2}{\percent}$\\
  50. \bottomrule
  51. \end{tabular}
  52. \caption{The systems $B_{hl=1}$ -- $B_{hl=4}$ were randomly initialized,
  53. trained and evaluated 5~times to estimate the influence of random
  54. weight initialization.}
  55. \label{table:baseline-systems-random-initializations-summary}
  56. \end{table}
  57. \subsection{Stroke connection}
  58. In order to solve the problem of interrupted strokes, pairs of strokes
  59. can be connected with stroke connection algorithm. The idea is that for
  60. a pair of consecutively drawn strokes $s_{i}, s_{i+1}$ the last point $s_i$ is
  61. close to the first point of $s_{i+1}$ if a stroke was accidentally split
  62. into two strokes.
  63. $\SI{59}{\percent}$ of all stroke pair distances in the collected data are
  64. between $\SI{30}{\pixel}$ and $\SI{150}{\pixel}$. Hence the stroke connection
  65. algorithm was evaluated with $\SI{5}{\pixel}$, $\SI{10}{\pixel}$ and
  66. $\SI{20}{\pixel}$.
  67. All models top-3 error improved with a threshold of $\theta = \SI{10}{\pixel}$
  68. by at least $\num{0.2}$ percentage points, except $B_{hl=4}$ which did not notably
  69. improve.
  70. \subsection{Douglas-Peucker Smoothing}
  71. The Douglas-Peucker algorithm was applied with a threshold of $\varepsilon =
  72. 0.05$, $\varepsilon = 0.1$ and $\varepsilon = 0.2$ after scaling and shifting,
  73. but before resampling. The interpolation in the resampling step was done
  74. linearly and with cubic splines in two experiments. The recording was scaled
  75. and shifted again after the interpolation because the bounding box might have
  76. changed.
  77. The result of the application of the Douglas-Peucker smoothing with $\varepsilon
  78. > 0.05$ was a high rise of the top-1 and top-3 error for all models $B_{hl=i}$.
  79. This means that the simplification process removes some relevant information and
  80. does not---as it was expected---remove only noise. For $\varepsilon = 0.05$
  81. with linear interpolation some models top-1 error improved, but the
  82. changes were small. It could be an effect of random weight initialization.
  83. However, cubic spline interpolation made all systems perform more than
  84. $\num{1.7}$ percentage points worse for top-1 and top-3 error.
  85. The lower the value of $\varepsilon$, the less does the recording change after
  86. this preprocessing step. As it was applied after scaling the recording such that
  87. the biggest dimension of the recording (width or height) is $1$, a value of
  88. $\varepsilon = 0.05$ means that a point has to move at least $\SI{5}{\percent}$
  89. of the biggest dimension.
  90. \subsection{Global Features}
  91. Single global features were added one at a time to the baseline systems. Those
  92. features were re-curvature
  93. $\text{re-curvature}(stroke) = \frac{\text{height}(stroke)}{\text{length}(stroke)}$
  94. as described in \cite{Huang06}, the ink feature which is the summed length
  95. of all strokes, the stroke count, the aspect ratio and the stroke center points
  96. for the first four strokes. The stroke center point feature improved the system
  97. $B_{hl=1}$ by $\num{0.3}$~percentage points for the top-3 error and system $B_{hl=3}$ for
  98. the top-1 error by $\num{0.7}$~percentage points, but all other systems and
  99. error measures either got worse or did not improve much.
  100. The other global features did improve the systems $B_{hl=1}$ -- $B_{hl=3}$, but not
  101. $B_{hl=4}$. The highest improvement was achieved with the re-curvature feature. It
  102. improved the systems $B_{hl=1}$ -- $B_{hl=4}$ by more than $\num{0.6}$~percentage points
  103. top-1 error.
  104. \subsection{Data Multiplication}
  105. Data multiplication can be used to make the model invariant to transformations.
  106. However, this idea seems not to work well in the domain of on-line handwritten
  107. mathematical symbols. We tripled the data by adding a version that is rotated
  108. 3~degrees to the left and another one that is rotated 3~degrees to the right
  109. around the center of mass. This data multiplication made all classifiers for
  110. most error measures perform worse by more than $\num{2}$~percentage points for
  111. the top-1 error.
  112. The same experiment was executed by rotating by 6~degrees and in another
  113. experiment by 9~degrees, but those performed even worse.
  114. Also multiplying the data by a factor of 5 by adding two 3-degree rotated
  115. variants and two 6-degree rotated variant made the classifier perform worse
  116. by more than $\num{2}$~percentage points.
  117. \subsection{Pretraining}\label{subsec:pretraining-evaluation}
  118. Pretraining is a technique used to improve the training of \glspl{MLP} with
  119. multiple hidden layers.
  120. \Cref{table:pretraining-slp} shows that \gls{SLP} improves the classification
  121. performance by $\num{1.6}$ percentage points for the top-1 error and
  122. $\num{1.0}$ percentage points for the top-3 error. As one can see in
  123. \cref{fig:training-and-test-error-for-different-topologies-pretraining}, this
  124. is not only the case because of the longer training as the test error is
  125. relatively stable after $\num{1000}$ epochs of training. This was confirmed
  126. by an experiment where the baseline systems where trained for $\num{10000}$
  127. epochs and did not perform notably different.
  128. \begin{figure}[htb]
  129. \centering
  130. \input{figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex}
  131. \caption{Training- and test error by number of trained epochs for different
  132. topologies with \acrfull{SLP}. The plot shows
  133. that all pretrained systems performed much better than the systems
  134. without pretraining. All plotted systems did not improve
  135. with more epochs of training.}
  136. \label{fig:training-and-test-error-for-different-topologies-pretraining}
  137. \end{figure}
  138. \begin{table}[tb]
  139. \centering
  140. \begin{tabular}{lrrrr}
  141. \toprule
  142. \multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
  143. \cmidrule(l){2-5}
  144. & Top-1 & Change & Top-3 & Change \\\midrule
  145. $B_{hl=1}$ & $\SI{23.2}{\percent}$ & - & $\SI{6.7}{\percent}$ & - \\
  146. $B_{hl=2,SLP}$ & $\SI{19.9}{\percent}$ & $\SI{-1.7}{\percent}$ & $\SI{4.7}{\percent}$ & $\SI{-1.0}{\percent}$\\
  147. $B_{hl=3,SLP}$ & \underline{$\SI{19.4}{\percent}$} & $\SI{-2.5}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.1}{\percent}$\\
  148. $B_{hl=4,SLP}$ & $\SI{19.6}{\percent}$ & $\SI{-4.3}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.6}{\percent}$\\
  149. \bottomrule
  150. \end{tabular}
  151. \caption{Systems with 1--4 hidden layers which used \acrfull{SLP}
  152. compared to the mean of systems $B_{hl=1}$--$B_{hl=4}$ displayed
  153. in \cref{table:baseline-systems-random-initializations-summary}
  154. which used pure gradient descent. The \gls{SLP}
  155. systems clearly performed worse.}
  156. \label{table:pretraining-slp}
  157. \end{table}
  158. Pretraining with denoising auto-encoder lead to the much worse results listed in
  159. \cref{table:pretraining-denoising-auto-encoder}. The first layer used a $\tanh$
  160. activation function. Every layer was trained for $1000$ epochs and the
  161. \gls{MSE} loss function. A learning-rate of $\eta = 0.001$, a corruption of
  162. $\varkappa = 0.3$ and a $L_2$ regularization of $\lambda = 10^{-4}$ were
  163. chosen. This pretraining setup made all systems with all error measures perform
  164. much worse.
  165. \begin{table}[tb]
  166. \centering
  167. \begin{tabular}{lrrrr}
  168. \toprule
  169. \multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
  170. \cmidrule(l){2-5}
  171. & Top-1 & Change & Top-3 & Change \\\midrule
  172. $B_{hl=1,AEP}$ & $\SI{23.8}{\percent}$ & $\SI{+0.6}{\percent}$ & $\SI{7.2}{\percent}$ & $\SI{+0.5}{\percent}$\\
  173. $B_{hl=2,AEP}$ & \underline{$\SI{22.8}{\percent}$} & $\SI{+1.2}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{+0.7}{\percent}$\\
  174. $B_{hl=3,AEP}$ & $\SI{23.1}{\percent}$ & $\SI{+1.2}{\percent}$ & \underline{$\SI{6.1}{\percent}$} & $\SI{+0.4}{\percent}$\\
  175. $B_{hl=4,AEP}$ & $\SI{25.6}{\percent}$ & $\SI{+1.7}{\percent}$ & $\SI{7.0}{\percent}$ & $\SI{+0.8}{\percent}$\\
  176. \bottomrule
  177. \end{tabular}
  178. \caption{Systems with denoising \acrfull{AEP} compared to pure
  179. gradient descent. The \gls{AEP} systems performed worse.}
  180. \label{table:pretraining-denoising-auto-encoder}
  181. \end{table}