소스 검색

Add papers/write-math-paper

Martin Thoma 9 년 전
부모
커밋
fe78311901
25개의 변경된 파일10624개의 추가작업 그리고 0개의 파일을 삭제
  1. 15 0
      documents/papers/write-math-paper/Makefile
  2. 11 0
      documents/papers/write-math-paper/README.md
  3. 17 0
      documents/papers/write-math-paper/abstract-500-chars.txt
  4. 1001 0
      documents/papers/write-math-paper/baseline-1.csv
  5. 1001 0
      documents/papers/write-math-paper/baseline-2-pretraining.csv
  6. 1001 0
      documents/papers/write-math-paper/baseline-2.csv
  7. 46 0
      documents/papers/write-math-paper/ch1-introduction.tex
  8. 36 0
      documents/papers/write-math-paper/ch2-general-system-design.tex
  9. 20 0
      documents/papers/write-math-paper/ch3-data-and-implementation.tex
  10. 113 0
      documents/papers/write-math-paper/ch4-algorithms.tex
  11. 214 0
      documents/papers/write-math-paper/ch5-optimization-of-system-design.tex
  12. 123 0
      documents/papers/write-math-paper/ch6-summary.tex
  13. 32 0
      documents/papers/write-math-paper/ch7-mfrdb-eval.tex
  14. 35 0
      documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/Makefile
  15. 1001 0
      documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-1.csv
  16. 1001 0
      documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2-pretraining.csv
  17. 1001 0
      documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2.csv
  18. 1001 0
      documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-3-pretraining.csv
  19. 1001 0
      documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-4-pretraining.csv
  20. 31 0
      documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex
  21. 74 0
      documents/papers/write-math-paper/glossary.tex
  22. BIN
      documents/papers/write-math-paper/sRGBIEC1966-2.1.icm
  23. 12 0
      documents/papers/write-math-paper/variables.tex
  24. 1750 0
      documents/papers/write-math-paper/write-math-ba-paper.bib
  25. 87 0
      documents/papers/write-math-paper/write-math-ba-paper.tex

+ 15 - 0
documents/papers/write-math-paper/Makefile

@@ -0,0 +1,15 @@
+DOKUMENT = write-math-ba-paper
+make:
+	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # aux-files for makeindex / makeglossaries
+	makeglossaries $(DOKUMENT)
+	bibtex $(DOKUMENT)
+	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include index
+	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include symbol table
+	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include symbol table
+	make clean
+
+combine:
+	pdftk Dok1-open-kit.pdf KIT_SWP_Vorlage_Impressum_en_2015.pdf write-math-ba-paper.pdf KIT-WSP_RS_en.pdf cat output single-symbol-classification-paper.pdf
+
+clean:
+	rm -rf  $(TARGET) *.class *.html *.log *.aux *.out *.thm *.idx *.toc *.ind *.ilg figures/torus.tex *.glg *.glo *.gls *.ist *.xdy *.fdb_latexmk *.bak *.blg *.bbl *.glsdefs *.acn *.acr *.alg *.nls *.nlo *.bak *.pyg *.lot *.lof

+ 11 - 0
documents/papers/write-math-paper/README.md

@@ -0,0 +1,11 @@
+[Download compiled PDF](https://github.com/MartinThoma/write-math-paper/blob/master/write-math-ba-paper.pdf?raw=true)
+
+## License
+
+This is work is licensed under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/).
+
+## Spell checking
+* Spell checking `for f in ch*.tex; do aspell --lang=en --mode=tex check $f; done`
+* Spell checking `for f in ch*.tex; do /home/moose/GitHub/Academic-Writing-Check/checkwriting $f; done`
+* Spell checking with `http://www.reverso.net/spell-checker`
+* https://github.com/devd/Academic-Writing-Check

+ 17 - 0
documents/papers/write-math-paper/abstract-500-chars.txt

@@ -0,0 +1,17 @@
+Autoren: Thoma, Martin; Kilgour, Kevin; Stüker, Sebastian; Waibel, Alexander
+Titel: On-line Recognition of Handwritten Mathematical Symbols
+Institut: Institute for Anthropomatics and Robotics
+
+Abstract (max 500 Zeichen):
+This paper presents a classification system which uses the pen trajectory to
+classify handwritten symbols. Five preprocessing steps, one data multiplication
+algorithm, five features and five variants for multilayer Perceptron training
+were evaluated using $\num{166898}$ recordings. The evaluation results of
+21~experiments were used to create an optimized recognizer. This improvement
+was achieved by \acrlong{SLP} and adding new features.
+
+Keywords (max 5): recognition; machine learning; neural networks; symbols;
+multilayer perceptron
+
+Geplanter Veröffentlichungstermin: 1. August 2015
+

파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1001 - 0
documents/papers/write-math-paper/baseline-1.csv


파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1001 - 0
documents/papers/write-math-paper/baseline-2-pretraining.csv


파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1001 - 0
documents/papers/write-math-paper/baseline-2.csv


+ 46 - 0
documents/papers/write-math-paper/ch1-introduction.tex

@@ -0,0 +1,46 @@
+%!TEX root = write-math-ba-paper.tex
+
+\section{Introduction}
+On-line recognition makes use of the pen trajectory. One possible
+representation of the data is given as groups of sequences of tuples $(x, y, t)
+\in \mathbb{R}^3$, where each group represents a stroke, $(x, y)$ is the
+position of the pen on a canvas and $t$ is the time.
+
+% On-line data was used to classify handwritten natural language text in many
+% different variants. For example, the $\text{NPen}^{++}$ system classified
+% cursive handwriting into English words by using hidden Markov models and neural
+% networks~\cite{Manke1995}.
+
+% Several systems for mathematical symbol recognition with on-line data have been
+% described so far~\cite{Kosmala98,Mouchere2013}, but no standard test set
+% existed to compare the results of different classifiers for single-symbol
+% classification of mathematical symbols. The used symbols differed in most
+% papers. This is unfortunate as the choice of symbols is crucial for the top-$n$
+% error. For example, the symbols $o$, $O$, $\circ$ and $0$ are very similar and
+% systems which know all those classes will certainly have a higher top-$n$ error
+% than systems which only accept one of them. But not only the classes differed,
+% also the used data to train and test had to be collected by each author again.
+
+\cite{Kirsch}~describes a system called Detexify which uses
+time warping to classify on-line handwritten symbols and reports a top-3 error
+of less than $\SI{10}{\percent}$ for a set of $\num{100}$~symbols. He did also
+recently publish his data on \url{https://github.com/kirel/detexify-data},
+which was collected by a crowdsourcing approach via
+\url{http://detexify.kirelabs.org}. Those recordings as well as some recordings
+which were collected by a similar approach via \url{http://write-math.com} were
+merged in a single data set, the labels were semi-automatically checked for
+correctness and used to train and evaluated different classifiers. A more
+detailed description of all used software, data and experiments is given
+in~\cite{Thoma:2014}.
+
+In this paper we present a baseline system for the classification of on-line
+handwriting into $369$ classes of which some are very similar. An optimized
+classifier was developed which has a $\SI{29.7}{\percent}$ relative improvement
+of the top-3 error. This was achieved by using better features and \gls{SLP}.
+The absolute improvements compared to the baseline of those changes will also
+be shown.
+
+In the following, we will give a general overview of the system design, give
+information about the used data and implementation, describe the algorithms
+we used to classify the data, report results of our experiments and present
+the optimized recognizer we created.

+ 36 - 0
documents/papers/write-math-paper/ch2-general-system-design.tex

@@ -0,0 +1,36 @@
+%!TEX root = write-math-ba-paper.tex
+
+\section{General System Design}
+The following steps are used for symbol classification:\nobreak
+\begin{enumerate}
+    \item \textbf{Preprocessing}: Recorded data is never perfect. Devices have
+          errors and people make mistakes while using the devices. To tackle
+          these problems there are preprocessing algorithms to clean the data.
+          The preprocessing algorithms can also remove unnecessary variations
+          of the data that do not help in the classification process, but hide
+          what is important. Having slightly different sizes of the same symbol
+          is an example of such a variation. Four preprocessing algorithms that
+          clean or normalize recordings are explained in
+          \cref{sec:preprocessing}.
+    \item \textbf{Data multiplication}: Learning systems need lots of data
+          to learn internal parameters. If there is not enough data available,
+          domain knowledge can be considered to create new artificial data from
+          the original data. In the domain of on-line handwriting recognition,
+          data can be multiplied by adding rotated variants.
+    \item \textbf{Feature extraction}: A feature is high-level information
+          derived from the raw data after preprocessing. Some systems like
+          Detexify take the result of the preprocessing step, but many compute
+          new features. Those features can be designed by a human engineer or
+          learned. Non-raw data features have the advantage that less
+          training data is needed since the developer uses knowledge about
+          handwriting to compute highly discriminative features. Various
+          features are explained in \cref{sec:features}.
+\end{enumerate}
+
+After these steps, it is a classification task for which the classifier has to
+learn internal parameters before it can classify new recordings.We classified
+recordings by computing constant-sized feature vectors and using
+\glspl{MLP}. There are many ways to adjust \glspl{MLP} (number of neurons and
+layers, activation functions) and their training (learning rate, momentum,
+error function). Some of them are described in~\cref{sec:mlp-training} and the
+evaluation results are presented in \cref{ch:Optimization-of-System-Design}.

+ 20 - 0
documents/papers/write-math-paper/ch3-data-and-implementation.tex

@@ -0,0 +1,20 @@
+%!TEX root = write-math-ba-paper.tex
+
+\section{Data and Implementation}
+We used $\num{369}$ symbol classes with a total of $\num{166898}$ labeled
+recordings. Each class has at least $\num{50}$ labeled recordings, but over
+$200$ symbols have more than $\num{200}$ labeled recordings and over $100$
+symbols have more than $500$ labeled recordings.
+The data was collected by two crowd-sourcing projects (Detexify and
+\href{http://write-math.com}{write-math.com}) where users wrote
+symbols, were then given a list ordered by an early classification system and
+clicked on the symbol they wrote.
+
+The data of Detexify and \href{http://write-math.com}{write-math.com} was
+combined, filtered semi-automatically and can be downloaded via
+\href{http://write-math.com/data}{write-math.com/data} as a compressed tar
+archive of CSV files.
+
+All of the following preprocessing and feature computation algorithms were
+implemented and are publicly available as open-source software in the Python
+package \texttt{hwrt}.

+ 113 - 0
documents/papers/write-math-paper/ch4-algorithms.tex

@@ -0,0 +1,113 @@
+%!TEX root = write-math-ba-paper.tex
+
+\section{Algorithms}
+\subsection{Preprocessing}\label{sec:preprocessing}
+Preprocessing in symbol recognition is done to improve the quality and
+expressive power of the data. It makes follow-up tasks like feature extraction
+and classification easier, more effective or faster. It does so by resolving
+errors in the input data, reducing duplicate information and removing
+irrelevant information.
+
+Preprocessing algorithms fall into two groups: Normalization and noise
+reduction algorithms.
+
+A very important normalization algorithm in single-symbol recognition is
+\textit{scale-and-shift}~\cite{Thoma:2014}. It scales the recording so that
+its bounding box fits into a unit square. As the aspect ratio of a recording is
+almost never 1:1, only one dimension will fit exactly in the unit square. For
+this paper, it was chosen to shift the recording in the direction of its bigger
+dimension into the $[0,1] \times [0,1]$ unit square. After that, the recording
+is shifted in direction of its smaller dimension such that its bounding box is
+centered around zero.
+
+Another normalization preprocessing algorithm is
+resampling~\cite{Guyon91,Manke01}. As the data points on the pen trajectory are
+generated asynchronously and with different time-resolutions depending on the
+used hardware and software, it is desirable to resample the recordings to have
+points spread equally in time for every recording. This was done by linear
+interpolation of the $(x,t)$ and $(y,t)$ sequences and getting a fixed number
+of equally spaced points per stroke.
+
+\textit{Stroke connection} is a noise reduction algorithm which is mentioned
+in~\cite{Tappert90}. It happens sometimes that the hardware detects that the
+user lifted the pen where the user certainly didn't do so. This can be detected
+by measuring the Euclidean distance between the end of one stroke and the
+beginning of the next stroke. If this distance is below a threshold, then the
+strokes are connected.
+
+Due to a limited resolution of the recording device and due to erratic
+handwriting, the pen trajectory might not be smooth. One way to smooth is
+calculating a weighted average and replacing points by the weighted average of
+their coordinate and their neighbors coordinates. Another way to do smoothing
+is to reduce the number of points with the Douglas-Peucker
+algorithm to the points that are more relevant for the
+overall shape of a stroke and then interpolate the stroke between those points.
+The Douglas-Peucker stroke simplification algorithm is usually used in
+cartography to simplify the shape of roads. It works recursively to find a
+subset of points of a stroke that is simpler and still similar to the original
+shape. The algorithm adds the first and the last point $p_1$ and $p_n$ of a
+stroke to the simplified set of points $S$. Then it searches the point $p_i$ in
+between that has maximum distance from the line $p_1 p_n$. If this distance is
+above a threshold $\varepsilon$, the point $p_i$ is added to $S$. Then the
+algorithm gets applied to $p_1 p_i$ and $p_i p_n$ recursively. It is described
+as \enquote{Algorithm 1} in~\cite{Visvalingam1990}.
+
+\subsection{Features}\label{sec:features}
+Features can be \textit{global}, that means calculated for the complete
+recording or complete strokes. Other features are calculated for single points
+on the pen trajectory and are called \textit{local}.
+
+Global features are the \textit{number of strokes} in a recording, the
+\textit{aspect ratio} of a recordings bounding box or the
+\textit{ink} being used for a recording. The ink feature gets calculated by
+measuring the length of all strokes combined. The re-curvature, which was
+introduced in~\cite{Huang06}, is defined as
+\[\text{re-curvature}(stroke) := \frac{\text{height}(stroke)}{\text{length}(stroke)}\]
+and a stroke-global feature.
+
+The simplest local feature is the coordinate of the point itself. Speed,
+curvature and a local small-resolution bitmap around the point, which was
+introduced by Manke, Finke and Waibel in~\cite{Manke1995}, are other local
+features.
+
+\subsection{Multilayer Perceptrons}\label{sec:mlp-training}
+\Glspl{MLP} are explained in detail in~\cite{Mitchell97}. They can have
+different numbers of hidden layers, the number of neurons per layer and the
+activation functions can be varied. The learning algorithm is parameterized by
+the learning rate $\eta \in (0, \infty)$, the momentum $\alpha \in [0, \infty)$
+and the number of epochs.
+
+The topology of \glspl{MLP} will be denoted in the following by separating the
+number of neurons per layer with colons. For example, the notation
+$160{:}500{:}500{:}500{:}369$ means that the input layer gets 160~features,
+there are three hidden layers with 500~neurons per layer and one output layer
+with 369~neurons.
+
+\glspl{MLP} training can be executed in various different ways, for example
+with \acrfull{SLP}. In case of a \gls{MLP} with the topology
+$160{:}500{:}500{:}500{:}369$, \gls{SLP} works as follows: At first a \gls{MLP}
+with one hidden layer ($160{:}500{:}369$) is trained. Then the output layer is
+discarded, a new hidden layer and a new output layer is added and it is trained
+again, resulting in a $160{:}500{:}500{:}369$ \gls{MLP}. The output layer is
+discarded again, a new hidden layer is added and a new output layer is added
+and the training is executed again.
+
+Denoising auto-encoders are another way of pretraining. An
+\textit{auto-encoder} is a neural network that is trained to restore its input.
+This means the number of input neurons is equal to the number of output
+neurons. The weights define an \textit{encoding} of the input that allows
+restoring the input. As the neural network finds the encoding by itself, it is
+called auto-encoder. If the hidden layer is smaller than the input layer, it
+can be used for dimensionality reduction~\cite{Hinton1989}. If only one hidden
+layer with linear activation functions is used, then the hidden layer contains
+the principal components after training~\cite{Duda2001}.
+
+Denoising auto-encoders are a variant introduced in~\cite{Vincent2008} that
+is more robust to partial corruption of the input features. It is trained to
+get robust by adding noise to the input features.
+
+There are multiple ways how noise can be added. Gaussian noise and randomly
+masking elements with zero are two possibilities.
+\cite{Deeplearning-Denoising-AE} describes how such a denoising auto-encoder
+with masking noise can be implemented. The corruption $\varkappa \in [0, 1)$ is
+the probability of a feature being masked.

+ 214 - 0
documents/papers/write-math-paper/ch5-optimization-of-system-design.tex

@@ -0,0 +1,214 @@
+%!TEX root = write-math-ba-paper.tex
+
+\section{Optimization of System Design}\label{ch:Optimization-of-System-Design}
+In order to evaluate the effect of different preprocessing algorithms, features
+and adjustments in the \gls{MLP} training and topology, the following baseline
+system was used:
+
+Scale the recording to fit into a unit square while keeping the aspect ratio,
+shift it as described in \cref{sec:preprocessing},
+resample it with linear interpolation to get 20~points per stroke, spaced
+evenly in time. Take the first 4~strokes with 20~points per stroke and
+2~coordinates per point as features, resulting in 160~features which is equal
+to the number of input neurons. If a recording has less than 4~strokes, the
+remaining features were filled with zeroes.
+
+All experiments were evaluated with four baseline systems $B_{hl=i}$, $i \in \Set{1,
+2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
+could have a severe influence on the effect of new features or preprocessing
+steps. Each hidden layer in all evaluated systems has $500$ neurons.
+
+Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
+of $\alpha = 0.1$. The activation function of every neuron in a hidden layer is
+the sigmoid function. The neurons in the
+output layer use the softmax function. For every experiment, exactly one part
+of the baseline systems was changed.
+
+
+\subsection{Random Weight Initialization}
+The neural networks in all experiments got initialized with a small random
+weight
+
+\[w_{i,j} \sim U(-4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}}, 4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}})\]
+
+where $w_{i,j}$ is the weight between the neurons $i$ and $j$, $l$ is the layer
+of neuron $i$, and $n_i$ is the number of neurons in layer $i$. This random
+initialization was suggested on
+\cite{deeplearningweights} and is done to break symmetry.
+
+This can lead to different error rates for the same systems just because the
+initialization was different.
+
+In order to get an impression of the magnitude of the influence on the different
+topologies and error rates the baseline models were trained 5 times with
+random initializations.
+\Cref{table:baseline-systems-random-initializations-summary}
+shows a summary of the results. The more hidden layers are used, the more do
+the results vary between different random weight initializations.
+
+\begin{table}[h]
+    \centering
+    \begin{tabular}{crrr|rrr} %chktex 44
+    \toprule
+    \multirow{3}{*}{System}  & \multicolumn{6}{c}{Classification error}\\
+    \cmidrule(l){2-7}
+               & \multicolumn{3}{c}{Top-1}   & \multicolumn{3}{c}{Top-3}\\
+               & Min                   & Max                   & Mean                  & Min                  & Max                  & Mean\\\midrule
+    $B_{hl=1}$ & $\SI{23.1}{\percent}$ & $\SI{23.4}{\percent}$ & $\SI{23.2}{\percent}$ & $\SI{6.7}{\percent}$ & $\SI{6.8}{\percent}$ & $\SI{6.7}{\percent}$ \\
+    $B_{hl=2}$ & \underline{$\SI{21.4}{\percent}$} & \underline{$\SI{21.8}{\percent}$}& \underline{$\SI{21.6}{\percent}$} & $\SI{5.7}{\percent}$ & \underline{$\SI{5.8}{\percent}$} & \underline{$\SI{5.7}{\percent}$}\\
+    $B_{hl=3}$ & $\SI{21.5}{\percent}$ & $\SI{22.3}{\percent}$ & $\SI{21.9}{\percent}$ & \underline{$\SI{5.5}{\percent}$} & $\SI{5.8}{\percent}$ & \underline{$\SI{5.7}{\percent}$}\\
+    $B_{hl=4}$ & $\SI{23.2}{\percent}$ & $\SI{24.8}{\percent}$ & $\SI{23.9}{\percent}$ & $\SI{6.0}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{6.2}{\percent}$\\
+    \bottomrule
+    \end{tabular}
+    \caption{The systems $B_{hl=1}$ -- $B_{hl=4}$ were randomly initialized,
+             trained and evaluated 5~times to estimate the influence of random
+             weight initialization.}
+\label{table:baseline-systems-random-initializations-summary}
+\end{table}
+
+
+\subsection{Stroke connection}
+In order to solve the problem of interrupted strokes, pairs of strokes
+can be connected with stroke connection algorithm. The idea is that for
+a pair of consecutively drawn strokes $s_{i}, s_{i+1}$ the last point $s_i$ is
+close to the first point of $s_{i+1}$ if a stroke was accidentally split
+into two strokes.
+
+$\SI{59}{\percent}$ of all stroke pair distances in the collected data are
+between $\SI{30}{\pixel}$ and $\SI{150}{\pixel}$. Hence the stroke connection
+algorithm was evaluated with $\SI{5}{\pixel}$, $\SI{10}{\pixel}$ and
+$\SI{20}{\pixel}$.
+All models top-3 error improved with a threshold of $\theta = \SI{10}{\pixel}$
+by at least $\num{0.2}$ percentage points, except $B_{hl=4}$ which did not notably
+improve.
+
+
+\subsection{Douglas-Peucker Smoothing}
+The Douglas-Peucker algorithm was applied with a threshold of $\varepsilon =
+0.05$, $\varepsilon = 0.1$ and $\varepsilon = 0.2$ after scaling and shifting,
+but before resampling. The interpolation in the resampling step was done
+linearly and with cubic splines in two experiments. The recording was scaled
+and shifted again after the interpolation because the bounding box might have
+changed.
+
+The result of the application of the Douglas-Peucker smoothing with $\varepsilon
+> 0.05$ was a high rise of the top-1 and top-3 error for all models $B_{hl=i}$.
+This means that the simplification process removes some relevant information and
+does not---as it was expected---remove only noise. For $\varepsilon = 0.05$
+with linear interpolation some models top-1 error improved, but the
+changes were small. It could be an effect of random weight initialization.
+However, cubic spline interpolation made all systems perform more than
+$\num{1.7}$ percentage points worse for top-1 and top-3 error.
+
+The lower the value of $\varepsilon$, the less does the recording change after
+this preprocessing step. As it was applied after scaling the recording such that
+the biggest dimension of the recording (width or height) is $1$, a value of
+$\varepsilon = 0.05$ means that a point has to move at least $\SI{5}{\percent}$
+of the biggest dimension.
+
+
+\subsection{Global Features}
+Single global features were added one at a time to the baseline systems. Those
+features were re-curvature
+$\text{re-curvature}(stroke) = \frac{\text{height}(stroke)}{\text{length}(stroke)}$
+as described in \cite{Huang06}, the ink feature which is the summed length
+of all strokes, the stroke count, the aspect ratio and the stroke center points
+for the first four strokes. The stroke center point feature improved the system
+$B_{hl=1}$ by $\num{0.3}$~percentage points for the top-3 error and system $B_{hl=3}$ for
+the top-1 error by $\num{0.7}$~percentage points, but all other systems and
+error measures either got worse or did not improve much.
+
+The other global features did improve the systems $B_{hl=1}$ -- $B_{hl=3}$, but not
+$B_{hl=4}$. The highest improvement was achieved with the re-curvature feature. It
+improved the systems $B_{hl=1}$ -- $B_{hl=4}$ by more than $\num{0.6}$~percentage points
+top-1 error.
+
+
+\subsection{Data Multiplication}
+Data multiplication can be used to make the model invariant to transformations.
+However, this idea seems not to work well in the domain of on-line handwritten
+mathematical symbols. We tripled the data by adding a version that is rotated
+3~degrees to the left and another one that is rotated 3~degrees to the right
+around the center of mass. This data multiplication made all classifiers for
+most error measures perform worse by more than $\num{2}$~percentage points for
+the top-1 error.
+
+The same experiment was executed by rotating by 6~degrees and in another
+experiment by 9~degrees, but those performed even worse.
+
+Also multiplying the data by a factor of 5 by adding two 3-degree rotated
+variants and two 6-degree rotated variant made the classifier perform worse
+by more than $\num{2}$~percentage points.
+
+
+\subsection{Pretraining}\label{subsec:pretraining-evaluation}
+Pretraining is a technique used to improve the training of \glspl{MLP} with
+multiple hidden layers.
+
+\Cref{table:pretraining-slp} shows that \gls{SLP} improves the classification
+performance by $\num{1.6}$ percentage points for the top-1 error and
+$\num{1.0}$ percentage points for the top-3 error. As one can see in
+\cref{fig:training-and-test-error-for-different-topologies-pretraining}, this
+is not only the case because of the longer training as the test error is
+relatively stable after $\num{1000}$ epochs of training. This was confirmed
+by an experiment where the baseline systems where trained for $\num{10000}$
+epochs and did not perform notably different.
+
+\begin{figure}[htb]
+    \centering
+    \input{figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex}
+    \caption{Training- and test error by number of trained epochs for different
+             topologies with \acrfull{SLP}. The plot shows
+             that all pretrained systems performed much better than the systems
+             without pretraining. All plotted systems did not improve
+             with more epochs of training.}
+\label{fig:training-and-test-error-for-different-topologies-pretraining}
+\end{figure}
+
+\begin{table}[tb]
+    \centering
+    \begin{tabular}{lrrrr}
+    \toprule
+    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
+    \cmidrule(l){2-5}
+                & Top-1                  & Change               & Top-3                & Change                 \\\midrule
+    $B_{hl=1}$     & $\SI{23.2}{\percent}$  & -                    & $\SI{6.7}{\percent}$ & - \\
+    $B_{hl=2,SLP}$ & $\SI{19.9}{\percent}$ & $\SI{-1.7}{\percent}$ & $\SI{4.7}{\percent}$ & $\SI{-1.0}{\percent}$\\
+    $B_{hl=3,SLP}$ & \underline{$\SI{19.4}{\percent}$} & $\SI{-2.5}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.1}{\percent}$\\
+    $B_{hl=4,SLP}$ & $\SI{19.6}{\percent}$ & $\SI{-4.3}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.6}{\percent}$\\
+    \bottomrule
+    \end{tabular}
+    \caption{Systems with 1--4 hidden layers which used \acrfull{SLP}
+             compared to the mean of systems $B_{hl=1}$--$B_{hl=4}$ displayed
+             in \cref{table:baseline-systems-random-initializations-summary}
+             which used pure gradient descent. The \gls{SLP}
+             systems clearly performed worse.}
+\label{table:pretraining-slp}
+\end{table}
+
+
+Pretraining with denoising auto-encoder lead to the much worse results listed in
+\cref{table:pretraining-denoising-auto-encoder}. The first layer used a $\tanh$
+activation function. Every layer was trained for $1000$ epochs and the
+\gls{MSE} loss function. A learning-rate of $\eta = 0.001$, a corruption of
+$\varkappa = 0.3$ and a $L_2$ regularization of $\lambda = 10^{-4}$ were
+chosen. This pretraining setup made all systems with all error measures perform
+much worse.
+
+\begin{table}[tb]
+    \centering
+    \begin{tabular}{lrrrr}
+    \toprule
+    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
+    \cmidrule(l){2-5}
+                 & Top-1                  & Change               & Top-3                & Change                 \\\midrule
+    $B_{hl=1,AEP}$ & $\SI{23.8}{\percent}$ & $\SI{+0.6}{\percent}$ & $\SI{7.2}{\percent}$ & $\SI{+0.5}{\percent}$\\
+    $B_{hl=2,AEP}$ & \underline{$\SI{22.8}{\percent}$} & $\SI{+1.2}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{+0.7}{\percent}$\\
+    $B_{hl=3,AEP}$ & $\SI{23.1}{\percent}$ & $\SI{+1.2}{\percent}$ & \underline{$\SI{6.1}{\percent}$} & $\SI{+0.4}{\percent}$\\
+    $B_{hl=4,AEP}$ & $\SI{25.6}{\percent}$ & $\SI{+1.7}{\percent}$ & $\SI{7.0}{\percent}$ & $\SI{+0.8}{\percent}$\\
+    \bottomrule
+    \end{tabular}
+    \caption{Systems with denoising \acrfull{AEP} compared to pure
+             gradient descent. The \gls{AEP} systems performed worse.}
+\label{table:pretraining-denoising-auto-encoder}
+\end{table}

+ 123 - 0
documents/papers/write-math-paper/ch6-summary.tex

@@ -0,0 +1,123 @@
+%!TEX root = write-math-ba-paper.tex
+
+\section{Summary}
+Four baseline recognition systems were adjusted in many experiments and their
+recognition capabilities were compared in order to build a recognition system
+that can recognize 396 mathematical symbols with low error rates as well as to
+evaluate which preprocessing steps and features help to improve the recognition
+rate.
+
+All recognition systems were trained and evaluated with
+$\num{\totalCollectedRecordings{}}$ recordings for \totalClassesAnalyzed{}
+symbols. These recordings were collected by two crowdsourcing projects
+(\href{http://detexify.kirelabs.org/classify.html}{Detexify} and
+\href{write-math.com}{write-math.com}) and created with various devices. While
+some recordings were created with standard touch devices such as tablets and
+smartphones, others were created with the mouse.
+
+\Glspl{MLP} were used for the classification task. Four baseline systems with
+different numbers of hidden layers were used, as the number of hidden layer
+influences the capabilities and problems of \glspl{MLP}.
+
+All baseline systems used the same preprocessing queue. The recordings were
+scaled and shifted as described in \ref{sec:preprocessing}, resampled with
+linear interpolation so that every stroke had exactly 20~points which are
+spread equidistant in time. The 80~($x,y$) coordinates of the first 4~strokes
+were used to get exactly $160$ input features for every recording. The baseline
+system $B_{hl=2}$ has a top-3 error of $\SI{5.7}{\percent}$.
+
+Adding two slightly rotated variants for each recording and hence tripling the
+training set made the systems $B_{hl=3}$ and $B_{hl=4}$ perform much worse, but
+improved the performance of the smaller systems.
+
+The global features re-curvature, ink, stoke count and aspect ratio improved
+the systems $B_{hl=1}$--$B_{hl=3}$, whereas the stroke center point feature
+made $B_{hl=2}$ perform worse.
+
+Denoising auto-encoders were evaluated as one way to use pretraining, but by
+this the error rate increased notably. However, \acrlong{SLP} improved the
+performance decidedly.
+
+The stroke connection algorithm was added to the preprocessing steps of the
+baseline system as well as the re-curvature feature, the ink feature, the
+number of strokes and the aspect ratio. The training setup of the baseline
+system was changed to \acrlong{SLP} and the resulting model was trained with a
+lower learning rate again. This optimized recognizer $B_{hl=2,c}'$ had a top-3
+error of $\SI{4.0}{\percent}$. This means that the top-3 error dropped by over
+$\num{1.7}$ percentage points in comparison to the baseline system $B_{hl=2}$.
+
+A top-3 error of $\SI{4.0}{\percent}$ makes the system usable for symbol
+lookup. It could also be used as a starting point for the development of a
+multiple-symbol classifier.
+
+The aim of this work was to develop a symbol recognition system which is easy
+to use, fast and has high recognition rates as well as evaluating ideas for
+single symbol classifiers. Some of those goals were reached. The recognition
+system $B_{hl=2,c}'$ evaluates new recordings in a fraction of a second and has
+acceptable recognition rates.
+
+% Many algorithms were evaluated. However, there are still many other
+% algorithms which could be evaluated and, at the time of this work, the best
+% classifier $B_{hl=2,c}'$ is only available through the Python package
+% \texttt{hwrt}. It is planned to add an web version of that classifier online.
+
+\section{Optimized Recognizer}
+All preprocessing steps and features that were useful were combined to create a
+recognizer that performs best.
+
+All models were much better than everything that was tried before. The results
+of this experiment show that single-symbol recognition with
+\totalClassesAnalyzed{} classes and usual touch devices and the mouse can be
+done with a top-1 error rate of $\SI{18.6}{\percent}$ and a top-3 error of
+$\SI{4.1}{\percent}$. This was
+achieved by a \gls{MLP} with a $167{:}500{:}500{:}\totalClassesAnalyzed{}$ topology.
+
+It used the stroke connection algorithm to connect of which the ends were less
+than $\SI{10}{\pixel}$ away, scaled each recording to a unit square and shifted
+as described in \ref{sec:preprocessing}. After that, a linear resampling step
+was applied to the first 4 strokes to resample them to 20 points each. All
+other strokes were discarded.
+
+\goodbreak
+The 167 features were\mynobreakpar%
+\begin{itemize}
+     \item the first 4 strokes with 20 points per stroke resulting in 160
+           features,
+     \item the re-curvature for the first 4 strokes,
+     \item the ink,
+     \item the number of strokes and
+     \item the aspect ratio of the bounding box
+\end{itemize}
+
+\Gls{SLP} was applied with $\num{1000}$ epochs per layer, a
+learning rate of $\eta=0.1$ and a momentum of $\alpha=0.1$. After that, the
+complete model was trained again for $1000$ epochs with standard mini-batch
+gradient descent resulting in systems $B_{hl=1,c}'$ -- $B_{hl=4,c}'$.
+
+After the models $B_{hl=1,c}$ -- $B_{hl=4,c}$ were trained the first $1000$ epochs,
+they were trained again for $\num{1000}$ epochs with a learning rate of $\eta =
+0.05$. \Cref{table:complex-recognizer-systems-evaluation} shows that
+this improved the classifiers again.
+
+\begin{table}[htb]
+    \centering
+    \begin{tabular}{lrrrr}
+    \toprule
+    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
+    \cmidrule(l){2-5}
+              & Top-1                 & Change                & Top-3                & Change\\\midrule
+    $B_{hl=1,c}$ & $\SI{21.0}{\percent}$ & $\SI{-2.2}{\percent}$ & $\SI{5.2}{\percent}$ & $\SI{-1.5}{\percent}$\\
+    $B_{hl=2,c}$ & $\SI{18.3}{\percent}$ & $\SI{-3.3}{\percent}$ & $\SI{4.1}{\percent}$ & $\SI{-1.6}{\percent}$\\
+    $B_{hl=3,c}$ & \underline{$\SI{18.2}{\percent}$} & $\SI{-3.7}{\percent}$ & \underline{$\SI{4.1}{\percent}$} & $\SI{-1.6}{\percent}$\\
+    $B_{hl=4,c}$ & $\SI{18.6}{\percent}$ & $\SI{-5.3}{\percent}$ & $\SI{4.3}{\percent}$ & $\SI{-1.9}{\percent}$\\\midrule
+    $B_{hl=1,c}'$ & $\SI{19.3}{\percent}$ & $\SI{-3.9}{\percent}$ & $\SI{4.8}{\percent}$ & $\SI{-1.9}{\percent}$ \\
+    $B_{hl=2,c}'$ & \underline{$\SI{17.5}{\percent}$} & $\SI{-4.1}{\percent}$ & \underline{$\SI{4.0}{\percent}$} & $\SI{-1.7}{\percent}$\\
+    $B_{hl=3,c}'$ & $\SI{17.7}{\percent}$ & $\SI{-4.2}{\percent}$ & $\SI{4.1}{\percent}$ & $\SI{-1.6}{\percent}$\\
+    $B_{hl=4,c}'$ & $\SI{17.8}{\percent}$ & $\SI{-6.1}{\percent}$ & $\SI{4.3}{\percent}$ & $\SI{-1.9}{\percent}$\\
+    \bottomrule
+    \end{tabular}
+    \caption{Error rates of the optimized recognizer systems. The systems
+             $B_{hl=i,c}'$ were trained another $\num{1000}$ epochs with a learning rate
+             of $\eta=0.05$.}
+\label{table:complex-recognizer-systems-evaluation}
+\end{table}

+ 32 - 0
documents/papers/write-math-paper/ch7-mfrdb-eval.tex

@@ -0,0 +1,32 @@
+%!TEX root = write-math-ba-paper.tex
+
+\section{Evaluation}
+
+The optimized classifier was evaluated on three publicly available data sets:
+\verb+MfrDB_Symbols_v1.0+ \cite{Stria2012}, CROHME~2011 \cite{Mouchere2011},
+and CROHME~2012 \cite{Mouchere2012}.
+
+\verb+MfrDB_Symbols_v1.0+ contains recordings for 105~symbols, but for
+11~symbols less than 50~recordings were available. For this reason, the
+optimized classifier was evaluated on 94~of the 105~symbols.
+
+The evaluation results are given in \cref{table:public-eval-results}.
+
+\begin{table}[htb]
+    \centering
+    \begin{tabular}{lcrr}
+    \toprule
+    \multirow{2}{*}{Dataset}  & \multirow{2}{*}{Symbols}  & \multicolumn{2}{c}{Classification error}\\
+    \cmidrule(l){3-4}
+              & & Top-1                 & Top-3                \\\midrule
+    MfrDB       & 94 & $\SI{8.4}{\percent}$  & $\SI{1.3}{\percent}$ \\
+    CROHME 2011 & 56 & $\SI{10.2}{\percent}$ & $\SI{3.7}{\percent}$ \\
+    CROHME 2012 & 75 & $\SI{12.2}{\percent}$ & $\SI{4.1}{\percent}$ \\
+    \bottomrule
+    \end{tabular}
+    \caption{Error rates of the optimized recognizer systems. The systems
+             output layer was adjusted to the number of symbols it should
+             recognize and trained with the combined data from
+             write-math and the training given by the datasets.}
+\label{table:public-eval-results}
+\end{table}

+ 35 - 0
documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/Makefile

@@ -0,0 +1,35 @@
+SOURCE = errors-by-epoch-pretraining
+DELAY = 80
+DENSITY = 300
+WIDTH = 512
+
+make:
+	pdflatex $(SOURCE).tex -output-format=pdf
+	make clean
+
+clean:
+	rm -rf  $(TARGET) *.class *.html *.log *.aux *.data *.gnuplot
+
+gif:
+	pdfcrop $(SOURCE).pdf
+	convert -verbose -delay $(DELAY) -loop 0 -density $(DENSITY) $(SOURCE)-crop.pdf $(SOURCE).gif
+	make clean
+
+png:
+	make
+	make svg
+	inkscape $(SOURCE).svg -w $(WIDTH) --export-png=$(SOURCE).png
+
+transparentGif:
+	convert $(SOURCE).pdf -transparent white result.gif
+	make clean
+
+svg:
+	make
+	#inkscape $(SOURCE).pdf --export-plain-svg=$(SOURCE).svg
+	pdf2svg $(SOURCE).pdf $(SOURCE).svg
+	# Necessary, as pdf2svg does not always create valid svgs:
+	inkscape $(SOURCE).svg --export-plain-svg=$(SOURCE).svg
+	rsvg-convert -a -w $(WIDTH) -f svg $(SOURCE).svg -o $(SOURCE)2.svg
+	inkscape $(SOURCE)2.svg --export-plain-svg=$(SOURCE).svg
+	rm $(SOURCE)2.svg

파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1001 - 0
documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-1.csv


파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1001 - 0
documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2-pretraining.csv


파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1001 - 0
documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2.csv


파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1001 - 0
documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-3-pretraining.csv


파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1001 - 0
documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-4-pretraining.csv


+ 31 - 0
documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex

@@ -0,0 +1,31 @@
+\begin{tikzpicture}
+    \begin{axis}[
+            axis x line=middle,
+            axis y line=middle,
+            enlarge y limits=true,
+            xmin=0,
+            % xmax=1000,
+            ymin=0.18, ymax=0.4,
+            minor ytick={0, 0.01, ..., 1},
+            % width=15cm, height=8cm,     % size of the image
+            grid = both,
+            minor grid style={dashed, gray!30},
+            major grid style={gray!40},,
+            %grid style={dashed, gray!30},
+            ylabel=error,
+            xlabel=epoch,
+            legend cell align=left,
+            legend style={
+                at={(0.5,-0.1)},
+                anchor=north,
+                legend columns=2
+            }
+         ]
+          \addplot[mark=x,green] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-1.csv};
+          \addplot[mark=x,orange] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-2.csv};
+          \addplot[mark=x,red] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-2-pretraining.csv};
+          \legend{{1 hidden layer},
+                  {2 hidden layers},
+                  {2 hidden layers with pretraining}}
+    \end{axis}
+\end{tikzpicture}

+ 74 - 0
documents/papers/write-math-paper/glossary.tex

@@ -0,0 +1,74 @@
+%!TEX root = thesis.tex
+%Term definitions
+\newacronym{ANN}{ANN}{artificial neural network}
+\newacronym{CSR}{CSR}{cursive script recognition}
+\newacronym{DTW}{DTW}{dynamic time warping}
+\newacronym{GTW}{GTW}{greedy time warping}
+\newacronym{HMM}{HMM}{hidden Markov model}
+\newacronym{HWR}{HWR}{handwriting recognition}
+\newacronym{HWRT}{HWRT}{handwriting recognition toolkit}
+\newacronym{MLP}{MLP}{multilayer perceptron}
+\newacronym{MSE}{MSE}{mean squared error}
+\newacronym{OOV}{OOV}{out of vocabulary}
+\newacronym{TDNN}{TDNN}{time delay neural network}
+\newacronym{PCA}{PCA}{principal component analysis}
+\newacronym{LDA}{LDA}{linear discriminant analysis}
+\newacronym{CROHME}{CROHME}{Competition on Recognition of Online Handwritten Mathematical Expressions}
+\newacronym{GMM}{GMM}{Gaussian mixture model}
+\newacronym{SVM}{SVM}{support vector machine}
+\newacronym{PyPI}{PyPI}{Python Package Index}
+\newacronym{CFM}{CFM}{classification figure of merit}
+\newacronym{CE}{CE}{cross entropy}
+\newacronym{GPU}{GPU}{graphics processing unit}
+\newacronym{CUDA}{CUDA}{Compute Unified Device Architecture}
+\newacronym{SLP}{SLP}{supervised layer-wise pretraining}
+\newacronym{AEP}{AEP}{auto-encoder pretraining}
+
+% Term definitions
+\newglossaryentry{Detexify}{name={Detexify}, description={A system used for
+on-line handwritten symbol recognition which is described in \cite{Kirsch}}}
+
+\newglossaryentry{epoch}{name={epoch}, description={During iterative training of a neural network, an \textit{epoch} is a single pass through the entire training set, followed by testing of the verification set.\cite{Concise12}}}
+
+\newglossaryentry{hypothesis}{
+    name={hypothesis},
+    description={The recognition results which a classifier returns is called a hypothesis. In other words, it is the \enquote{guess} of a classifier},
+    plural=hypotheses
+}
+
+\newglossaryentry{reference}{
+    name={reference},
+    description={Labeled data is used to evaluate classifiers. Those labels are called references},
+}
+
+\newglossaryentry{YAML}{name={YAML}, description={YAML is a human-readable data format that can be used for configuration files}}
+\newglossaryentry{MER}{name={MER}, description={An error measure which combines symbols to equivalence classes. It was introduced on \cpageref{merged-error-introduction}}}
+
+\newglossaryentry{JSON}{name={JSON}, description={JSON, short for JavaScript Object Notation, is a language-independent data format that can be used to transmit data between a server and a client in web applications}}
+
+\newglossaryentry{hyperparamter}{name={hyperparamter}, description={A
+\textit{hyperparamter} is a parameter of a neural net, that cannot be learned,
+but has to be chosen}, symbol={\ensuremath{\theta}}}
+
+\newglossaryentry{learning rate}{name={learning rate}, description={A factor $0 \leq \eta \in \mdr$ that affects how fast new weights are learned. $\eta=0$ means that no new data is learned}, symbol={\ensuremath{\eta}}} % Andrew Ng: \alpha
+
+\newglossaryentry{learning rate decay}{name={learning rate decay}, description={The learning rate decay $0 < \alpha \leq 1$ is used to adjust the learning rate. After each epoch the learning rate $\eta$ is updated to $\eta \gets \eta \times \alpha$}, symbol={\ensuremath{\eta}}}
+
+\newglossaryentry{preactivation}{name={preactivation}, description={The preactivation of a neuron is the weighted sum of its input, before the activation function is applied}}
+
+\newglossaryentry{stroke}{name={stroke}, description={The path the pen took from
+the point where the pen was put down to the point where the pen was lifted first}}
+
+\newglossaryentry{line}{name={line}, description={Geometric object that is infinitely long
+and defined by two points.}}
+
+\newglossaryentry{line segment}{name={line segment}, description={Geometric object that has finite length
+and defined by two points.}}
+
+\newglossaryentry{symbol}{name={symbol}, description={An atomic semantic entity. A more detailed description can be found in \cref{sec:what-is-a-symbol}}}
+
+\newglossaryentry{weight}{name={weight}, description={A
+\textit{weight} is a parameter of a neural net, that can be learned}, symbol={\ensuremath{\weight}}}
+
+\newglossaryentry{control point}{name={control point}, description={A
+\textit{control point} is a point recorded by the input device.}}

BIN
documents/papers/write-math-paper/sRGBIEC1966-2.1.icm


+ 12 - 0
documents/papers/write-math-paper/variables.tex

@@ -0,0 +1,12 @@
+\newcommand{\totalCollectedRecordings}{166898}  % ACTUALITY
+\newcommand{\detexifyCollectedRecordings}{153423}
+\newcommand{\trainingsetsize}{134804}
+\newcommand{\validtionsetsize}{15161}
+\newcommand{\testsetsize}{17012}
+\newcommand{\totalClasses}{1111}
+\newcommand{\totalClassesAnalyzed}{369}
+\newcommand{\totalClassesAboveFifty}{680}
+\newcommand{\totalClassesNotAnalyzedBelowFifty}{431}
+\newcommand{\detexifyPercentage}{$\SI{91.93}{\percent}$}
+\newcommand{\recordingsWithDots}{$\SI{2.77}{\percent}$}  % excluding i,j, ...
+\newcommand{\recordingsWithDotsSizechange}{$\SI{0.85}{\percent}$}  % excluding i,j, ...

파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 1750 - 0
documents/papers/write-math-paper/write-math-ba-paper.bib


+ 87 - 0
documents/papers/write-math-paper/write-math-ba-paper.tex

@@ -0,0 +1,87 @@
+\documentclass[9pt,technote,a4paper]{IEEEtran}
+\usepackage{amssymb, amsmath} % needed for math
+
+\usepackage[a-1b]{pdfx}
+\usepackage{filecontents}
+\begin{filecontents*}{\jobname.xmpdata}
+    \Keywords{recognition\sep machine learning\sep neural networks\sep symbols\sep multilayer perceptron}
+    \Title{On-line Recognition of Handwritten Mathematical Symbols}
+    \Author{Martin Thoma, Kevin Kilgour, Sebastian St{\"u}ker and Alexander Waibel}
+    \Org{Institute for Anthropomatics and Robotics}
+    \Doi{}
+\end{filecontents*}
+
+\RequirePackage{ifpdf}
+\ifpdf \PassOptionsToPackage{pdfpagelabels}{hyperref} \fi
+\RequirePackage{hyperref}
+\usepackage{parskip}
+\usepackage[pdftex,final]{graphicx}
+\usepackage{csquotes}
+\usepackage{braket}
+\usepackage{booktabs}
+\usepackage{multirow}
+\usepackage{pgfplots}
+\usepackage{wasysym}
+\usepackage{caption}
+% \captionsetup{belowskip=12pt,aboveskip=4pt}
+\makeatletter
+\newcommand\mynobreakpar{\par\nobreak\@afterheading}
+\makeatother
+\usepackage[noadjust]{cite}
+\usepackage[nameinlink,noabbrev]{cleveref} % has to be after hyperref, ntheorem, amsthm
+\usepackage[binary-units,group-separator={,}]{siunitx}
+\sisetup{per-mode=fraction,binary-units=true}
+\DeclareSIUnit\pixel{px}
+\usepackage{glossaries}
+\loadglsentries[main]{glossary}
+\makeglossaries
+
+\title{On-line Recognition of Handwritten Mathematical Symbols}
+\author{Martin Thoma, Kevin Kilgour, Sebastian St{\"u}ker and Alexander Waibel}
+
+\hypersetup{
+  pdfauthor   = {Martin Thoma\sep Kevin Kilgour\sep Sebastian St{\"u}ker\sep Alexander Waibel},
+  pdfkeywords = {recognition\sep machine learning\sep neural networks\sep symbols\sep multilayer perceptron},
+  pdfsubject  = {Recognition},
+  pdftitle    = {On-line Recognition of Handwritten Mathematical Symbols},
+}
+\include{variables}
+\crefname{table}{Table}{Tables}
+\crefname{figure}{Figure}{Figures}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Begin document                                                    %
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{document}
+\maketitle
+\begin{abstract}
+The automatic recognition of single handwritten symbols has three main
+applications: Supporting users who know how a symbol looks like, but not what
+its name is, providing the necessary commands for professional publishing, or
+as a building block for formula recognition.
+
+This paper presents a system which uses the pen trajectory to classify
+handwritten symbols. Five preprocessing steps, one data multiplication
+algorithm, five features and five variants for multilayer Perceptron training
+were evaluated using $\num{166898}$ recordings. Those recordings were made
+publicly available. The evaluation results of these 21~experiments were used to
+create an optimized recognizer which has a top-1 error of less than
+$\SI{17.5}{\percent}$ and a top-3 error of $\SI{4.0}{\percent}$. This is a
+relative improvement of $\SI{18.5}{\percent}$ for the top-1 error and
+$\SI{29.7}{\percent}$ for the top-3 error compared to the baseline system. This
+improvement was achieved by \acrlong{SLP} and adding new features. The
+improved classifier can be used via \href{http://write-math.com/}{write-math.com}.
+\end{abstract}
+
+\input{ch1-introduction}
+\input{ch2-general-system-design}
+\input{ch3-data-and-implementation}
+\input{ch4-algorithms}
+\input{ch5-optimization-of-system-design}
+\input{ch6-summary}
+\input{ch7-mfrdb-eval}
+
+
+\bibliographystyle{IEEEtranSA}
+\bibliography{write-math-ba-paper}
+\end{document}