123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215 |
- %!TEX root = write-math-ba-paper.tex
- \section{Optimization of System Design}\label{ch:Optimization-of-System-Design}
- In order to evaluate the effect of different preprocessing algorithms, features
- and adjustments in the \gls{MLP} training and topology, the following baseline
- system was used:
- Scale the recording to fit into a unit square while keeping the aspect ratio,
- shift it as described in \cref{sec:preprocessing},
- resample it with linear interpolation to get 20~points per stroke, spaced
- evenly in time. Take the first 4~strokes with 20~points per stroke and
- 2~coordinates per point as features, resulting in 160~features which is equal
- to the number of input neurons. If a recording has less than 4~strokes, the
- remaining features were filled with zeroes.
- All experiments were evaluated with four baseline systems $B_{hl=i}$, $i \in \Set{1,
- 2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
- could have a severe influence on the effect of new features or preprocessing
- steps. Each hidden layer in all evaluated systems has $500$ neurons.
- Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
- of $\alpha = 0.1$. The activation function of every neuron in a hidden layer is
- the sigmoid function. The neurons in the
- output layer use the softmax function. For every experiment, exactly one part
- of the baseline systems was changed.
- \subsection{Random Weight Initialization}
- The neural networks in all experiments got initialized with a small random
- weight
- \[w_{i,j} \sim U(-4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}}, 4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}})\]
- where $w_{i,j}$ is the weight between the neurons $i$ and $j$, $l$ is the layer
- of neuron $i$, and $n_i$ is the number of neurons in layer $i$. This random
- initialization was suggested on
- \cite{deeplearningweights} and is done to break symmetry.
- This can lead to different error rates for the same systems just because the
- initialization was different.
- In order to get an impression of the magnitude of the influence on the different
- topologies and error rates the baseline models were trained 5 times with
- random initializations.
- \Cref{table:baseline-systems-random-initializations-summary}
- shows a summary of the results. The more hidden layers are used, the more do
- the results vary between different random weight initializations.
- \begin{table}[h]
- \centering
- \begin{tabular}{crrr|rrr} %chktex 44
- \toprule
- \multirow{3}{*}{System} & \multicolumn{6}{c}{Classification error}\\
- \cmidrule(l){2-7}
- & \multicolumn{3}{c}{Top-1} & \multicolumn{3}{c}{Top-3}\\
- & Min & Max & Mean & Min & Max & Mean\\\midrule
- $B_{hl=1}$ & $\SI{23.1}{\percent}$ & $\SI{23.4}{\percent}$ & $\SI{23.2}{\percent}$ & $\SI{6.7}{\percent}$ & $\SI{6.8}{\percent}$ & $\SI{6.7}{\percent}$ \\
- $B_{hl=2}$ & \underline{$\SI{21.4}{\percent}$} & \underline{$\SI{21.8}{\percent}$}& \underline{$\SI{21.6}{\percent}$} & $\SI{5.7}{\percent}$ & \underline{$\SI{5.8}{\percent}$} & \underline{$\SI{5.7}{\percent}$}\\
- $B_{hl=3}$ & $\SI{21.5}{\percent}$ & $\SI{22.3}{\percent}$ & $\SI{21.9}{\percent}$ & \underline{$\SI{5.5}{\percent}$} & $\SI{5.8}{\percent}$ & \underline{$\SI{5.7}{\percent}$}\\
- $B_{hl=4}$ & $\SI{23.2}{\percent}$ & $\SI{24.8}{\percent}$ & $\SI{23.9}{\percent}$ & $\SI{6.0}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{6.2}{\percent}$\\
- \bottomrule
- \end{tabular}
- \caption{The systems $B_{hl=1}$ -- $B_{hl=4}$ were randomly initialized,
- trained and evaluated 5~times to estimate the influence of random
- weight initialization.}
- \label{table:baseline-systems-random-initializations-summary}
- \end{table}
- \subsection{Stroke connection}
- In order to solve the problem of interrupted strokes, pairs of strokes
- can be connected with stroke connection algorithm. The idea is that for
- a pair of consecutively drawn strokes $s_{i}, s_{i+1}$ the last point $s_i$ is
- close to the first point of $s_{i+1}$ if a stroke was accidentally split
- into two strokes.
- $\SI{59}{\percent}$ of all stroke pair distances in the collected data are
- between $\SI{30}{\pixel}$ and $\SI{150}{\pixel}$. Hence the stroke connection
- algorithm was evaluated with $\SI{5}{\pixel}$, $\SI{10}{\pixel}$ and
- $\SI{20}{\pixel}$.
- All models top-3 error improved with a threshold of $\theta = \SI{10}{\pixel}$
- by at least $\num{0.2}$ percentage points, except $B_{hl=4}$ which did not notably
- improve.
- \subsection{Douglas-Peucker Smoothing}
- The Douglas-Peucker algorithm was applied with a threshold of $\varepsilon =
- 0.05$, $\varepsilon = 0.1$ and $\varepsilon = 0.2$ after scaling and shifting,
- but before resampling. The interpolation in the resampling step was done
- linearly and with cubic splines in two experiments. The recording was scaled
- and shifted again after the interpolation because the bounding box might have
- changed.
- The result of the application of the Douglas-Peucker smoothing with $\varepsilon
- > 0.05$ was a high rise of the top-1 and top-3 error for all models $B_{hl=i}$.
- This means that the simplification process removes some relevant information and
- does not---as it was expected---remove only noise. For $\varepsilon = 0.05$
- with linear interpolation some models top-1 error improved, but the
- changes were small. It could be an effect of random weight initialization.
- However, cubic spline interpolation made all systems perform more than
- $\num{1.7}$ percentage points worse for top-1 and top-3 error.
- The lower the value of $\varepsilon$, the less does the recording change after
- this preprocessing step. As it was applied after scaling the recording such that
- the biggest dimension of the recording (width or height) is $1$, a value of
- $\varepsilon = 0.05$ means that a point has to move at least $\SI{5}{\percent}$
- of the biggest dimension.
- \subsection{Global Features}
- Single global features were added one at a time to the baseline systems. Those
- features were re-curvature
- $\text{re-curvature}(stroke) = \frac{\text{height}(stroke)}{\text{length}(stroke)}$
- as described in \cite{Huang06}, the ink feature which is the summed length
- of all strokes, the stroke count, the aspect ratio and the stroke center points
- for the first four strokes. The stroke center point feature improved the system
- $B_{hl=1}$ by $\num{0.3}$~percentage points for the top-3 error and system $B_{hl=3}$ for
- the top-1 error by $\num{0.7}$~percentage points, but all other systems and
- error measures either got worse or did not improve much.
- The other global features did improve the systems $B_{hl=1}$ -- $B_{hl=3}$, but not
- $B_{hl=4}$. The highest improvement was achieved with the re-curvature feature. It
- improved the systems $B_{hl=1}$ -- $B_{hl=4}$ by more than $\num{0.6}$~percentage points
- top-1 error.
- \subsection{Data Multiplication}
- Data multiplication can be used to make the model invariant to transformations.
- However, this idea seems not to work well in the domain of on-line handwritten
- mathematical symbols. We tripled the data by adding a version that is rotated
- 3~degrees to the left and another one that is rotated 3~degrees to the right
- around the center of mass. This data multiplication made all classifiers for
- most error measures perform worse by more than $\num{2}$~percentage points for
- the top-1 error.
- The same experiment was executed by rotating by 6~degrees and in another
- experiment by 9~degrees, but those performed even worse.
- Also multiplying the data by a factor of 5 by adding two 3-degree rotated
- variants and two 6-degree rotated variant made the classifier perform worse
- by more than $\num{2}$~percentage points.
- \subsection{Pretraining}\label{subsec:pretraining-evaluation}
- Pretraining is a technique used to improve the training of \glspl{MLP} with
- multiple hidden layers.
- \Cref{table:pretraining-slp} shows that \gls{SLP} improves the classification
- performance by $\num{1.6}$ percentage points for the top-1 error and
- $\num{1.0}$ percentage points for the top-3 error. As one can see in
- \cref{fig:training-and-test-error-for-different-topologies-pretraining}, this
- is not only the case because of the longer training as the test error is
- relatively stable after $\num{1000}$ epochs of training. This was confirmed
- by an experiment where the baseline systems where trained for $\num{10000}$
- epochs and did not perform notably different.
- \begin{figure}[htb]
- \centering
- \input{figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex}
- \caption{Training- and test error by number of trained epochs for different
- topologies with \acrfull{SLP}. The plot shows
- that all pretrained systems performed much better than the systems
- without pretraining. All plotted systems did not improve
- with more epochs of training.}
- \label{fig:training-and-test-error-for-different-topologies-pretraining}
- \end{figure}
- \begin{table}[tb]
- \centering
- \begin{tabular}{lrrrr}
- \toprule
- \multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
- \cmidrule(l){2-5}
- & Top-1 & Change & Top-3 & Change \\\midrule
- $B_{hl=1}$ & $\SI{23.2}{\percent}$ & - & $\SI{6.7}{\percent}$ & - \\
- $B_{hl=2,SLP}$ & $\SI{19.9}{\percent}$ & $\SI{-1.7}{\percent}$ & $\SI{4.7}{\percent}$ & $\SI{-1.0}{\percent}$\\
- $B_{hl=3,SLP}$ & \underline{$\SI{19.4}{\percent}$} & $\SI{-2.5}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.1}{\percent}$\\
- $B_{hl=4,SLP}$ & $\SI{19.6}{\percent}$ & $\SI{-4.3}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.6}{\percent}$\\
- \bottomrule
- \end{tabular}
- \caption{Systems with 1--4 hidden layers which used \acrfull{SLP}
- compared to the mean of systems $B_{hl=1}$--$B_{hl=4}$ displayed
- in \cref{table:baseline-systems-random-initializations-summary}
- which used pure gradient descent. The \gls{SLP}
- systems clearly performed worse.}
- \label{table:pretraining-slp}
- \end{table}
- Pretraining with denoising auto-encoder lead to the much worse results listed in
- \cref{table:pretraining-denoising-auto-encoder}. The first layer used a $\tanh$
- activation function. Every layer was trained for $1000$ epochs and the
- \gls{MSE} loss function. A learning-rate of $\eta = 0.001$, a corruption of
- $\varkappa = 0.3$ and a $L_2$ regularization of $\lambda = 10^{-4}$ were
- chosen. This pretraining setup made all systems with all error measures perform
- much worse.
- \begin{table}[tb]
- \centering
- \begin{tabular}{lrrrr}
- \toprule
- \multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
- \cmidrule(l){2-5}
- & Top-1 & Change & Top-3 & Change \\\midrule
- $B_{hl=1,AEP}$ & $\SI{23.8}{\percent}$ & $\SI{+0.6}{\percent}$ & $\SI{7.2}{\percent}$ & $\SI{+0.5}{\percent}$\\
- $B_{hl=2,AEP}$ & \underline{$\SI{22.8}{\percent}$} & $\SI{+1.2}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{+0.7}{\percent}$\\
- $B_{hl=3,AEP}$ & $\SI{23.1}{\percent}$ & $\SI{+1.2}{\percent}$ & \underline{$\SI{6.1}{\percent}$} & $\SI{+0.4}{\percent}$\\
- $B_{hl=4,AEP}$ & $\SI{25.6}{\percent}$ & $\SI{+1.7}{\percent}$ & $\SI{7.0}{\percent}$ & $\SI{+0.8}{\percent}$\\
- \bottomrule
- \end{tabular}
- \caption{Systems with denoising \acrfull{AEP} compared to pure
- gradient descent. The \gls{AEP} systems performed worse.}
- \label{table:pretraining-denoising-auto-encoder}
- \end{table}
|