10 лет назад · fe78311901
--- a/documents/papers/write-math-paper/Makefile
+++ b/documents/papers/write-math-paper/Makefile
@@ -0,0 +1,15 @@
 
				+DOKUMENT = write-math-ba-paper
			
 
				+make:
			
 
				+	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # aux-files for makeindex / makeglossaries
			
 
				+	makeglossaries $(DOKUMENT)
			
 
				+	bibtex $(DOKUMENT)
			
 
				+	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include index
			
 
				+	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include symbol table
			
 
				+	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include symbol table
			
 
				+	make clean
			
 
				+
			
 
				+combine:
			
 
				+	pdftk Dok1-open-kit.pdf KIT_SWP_Vorlage_Impressum_en_2015.pdf write-math-ba-paper.pdf KIT-WSP_RS_en.pdf cat output single-symbol-classification-paper.pdf
			
 
				+
			
 
				+clean:
			
 
				+	rm -rf  $(TARGET) *.class *.html *.log *.aux *.out *.thm *.idx *.toc *.ind *.ilg figures/torus.tex *.glg *.glo *.gls *.ist *.xdy *.fdb_latexmk *.bak *.blg *.bbl *.glsdefs *.acn *.acr *.alg *.nls *.nlo *.bak *.pyg *.lot *.lof
			
--- a/documents/papers/write-math-paper/README.md
+++ b/documents/papers/write-math-paper/README.md
@@ -0,0 +1,11 @@
 
				+[Download compiled PDF](https://github.com/MartinThoma/write-math-paper/blob/master/write-math-ba-paper.pdf?raw=true)
			
 
				+
			
 
				+## License
			
 
				+
			
 
				+This is work is licensed under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/).
			
 
				+
			
 
				+## Spell checking
			
 
				+* Spell checking `for f in ch*.tex; do aspell --lang=en --mode=tex check $f; done`
			
 
				+* Spell checking `for f in ch*.tex; do /home/moose/GitHub/Academic-Writing-Check/checkwriting $f; done`
			
 
				+* Spell checking with `http://www.reverso.net/spell-checker`
			
 
				+* https://github.com/devd/Academic-Writing-Check
			
--- a/documents/papers/write-math-paper/abstract-500-chars.txt
+++ b/documents/papers/write-math-paper/abstract-500-chars.txt
@@ -0,0 +1,17 @@
 
				+Autoren: Thoma, Martin; Kilgour, Kevin; Stüker, Sebastian; Waibel, Alexander
			
 
				+Titel: On-line Recognition of Handwritten Mathematical Symbols
			
 
				+Institut: Institute for Anthropomatics and Robotics
			
 
				+
			
 
				+Abstract (max 500 Zeichen):
			
 
				+This paper presents a classification system which uses the pen trajectory to
			
 
				+classify handwritten symbols. Five preprocessing steps, one data multiplication
			
 
				+algorithm, five features and five variants for multilayer Perceptron training
			
 
				+were evaluated using $\num{166898}$ recordings. The evaluation results of
			
 
				+21~experiments were used to create an optimized recognizer. This improvement
			
 
				+was achieved by \acrlong{SLP} and adding new features.
			
 
				+
			
 
				+Keywords (max 5): recognition; machine learning; neural networks; symbols;
			
 
				+multilayer perceptron
			
 
				+
			
 
				+Geplanter Veröffentlichungstermin: 1. August 2015
			
 
				+
			
--- a/documents/papers/write-math-paper/baseline-1.csv
+++ b/documents/papers/write-math-paper/baseline-1.csv
--- a/documents/papers/write-math-paper/baseline-2-pretraining.csv
+++ b/documents/papers/write-math-paper/baseline-2-pretraining.csv
--- a/documents/papers/write-math-paper/baseline-2.csv
+++ b/documents/papers/write-math-paper/baseline-2.csv
--- a/documents/papers/write-math-paper/ch1-introduction.tex
+++ b/documents/papers/write-math-paper/ch1-introduction.tex
@@ -0,0 +1,46 @@
 
				+%!TEX root = write-math-ba-paper.tex
			
 
				+
			
 
				+\section{Introduction}
			
 
				+On-line recognition makes use of the pen trajectory. One possible
			
 
				+representation of the data is given as groups of sequences of tuples $(x, y, t)
			
 
				+\in \mathbb{R}^3$, where each group represents a stroke, $(x, y)$ is the
			
 
				+position of the pen on a canvas and $t$ is the time.
			
 
				+
			
 
				+% On-line data was used to classify handwritten natural language text in many
			
 
				+% different variants. For example, the $\text{NPen}^{++}$ system classified
			
 
				+% cursive handwriting into English words by using hidden Markov models and neural
			
 
				+% networks~\cite{Manke1995}.
			
 
				+
			
 
				+% Several systems for mathematical symbol recognition with on-line data have been
			
 
				+% described so far~\cite{Kosmala98,Mouchere2013}, but no standard test set
			
 
				+% existed to compare the results of different classifiers for single-symbol
			
 
				+% classification of mathematical symbols. The used symbols differed in most
			
 
				+% papers. This is unfortunate as the choice of symbols is crucial for the top-$n$
			
 
				+% error. For example, the symbols $o$, $O$, $\circ$ and $0$ are very similar and
			
 
				+% systems which know all those classes will certainly have a higher top-$n$ error
			
 
				+% than systems which only accept one of them. But not only the classes differed,
			
 
				+% also the used data to train and test had to be collected by each author again.
			
 
				+
			
 
				+\cite{Kirsch}~describes a system called Detexify which uses
			
 
				+time warping to classify on-line handwritten symbols and reports a top-3 error
			
 
				+of less than $\SI{10}{\percent}$ for a set of $\num{100}$~symbols. He did also
			
 
				+recently publish his data on \url{https://github.com/kirel/detexify-data},
			
 
				+which was collected by a crowdsourcing approach via
			
 
				+\url{http://detexify.kirelabs.org}. Those recordings as well as some recordings
			
 
				+which were collected by a similar approach via \url{http://write-math.com} were
			
 
				+merged in a single data set, the labels were semi-automatically checked for
			
 
				+correctness and used to train and evaluated different classifiers. A more
			
 
				+detailed description of all used software, data and experiments is given
			
 
				+in~\cite{Thoma:2014}.
			
 
				+
			
 
				+In this paper we present a baseline system for the classification of on-line
			
 
				+handwriting into $369$ classes of which some are very similar. An optimized
			
 
				+classifier was developed which has a $\SI{29.7}{\percent}$ relative improvement
			
 
				+of the top-3 error. This was achieved by using better features and \gls{SLP}.
			
 
				+The absolute improvements compared to the baseline of those changes will also
			
 
				+be shown.
			
 
				+
			
 
				+In the following, we will give a general overview of the system design, give
			
 
				+information about the used data and implementation, describe the algorithms
			
 
				+we used to classify the data, report results of our experiments and present
			
 
				+the optimized recognizer we created.
			
--- a/documents/papers/write-math-paper/ch2-general-system-design.tex
+++ b/documents/papers/write-math-paper/ch2-general-system-design.tex
@@ -0,0 +1,36 @@
 
				+%!TEX root = write-math-ba-paper.tex
			
 
				+
			
 
				+\section{General System Design}
			
 
				+The following steps are used for symbol classification:\nobreak
			
 
				+\begin{enumerate}
			
 
				+    \item \textbf{Preprocessing}: Recorded data is never perfect. Devices have
			
 
				+          errors and people make mistakes while using the devices. To tackle
			
 
				+          these problems there are preprocessing algorithms to clean the data.
			
 
				+          The preprocessing algorithms can also remove unnecessary variations
			
 
				+          of the data that do not help in the classification process, but hide
			
 
				+          what is important. Having slightly different sizes of the same symbol
			
 
				+          is an example of such a variation. Four preprocessing algorithms that
			
 
				+          clean or normalize recordings are explained in
			
 
				+          \cref{sec:preprocessing}.
			
 
				+    \item \textbf{Data multiplication}: Learning systems need lots of data
			
 
				+          to learn internal parameters. If there is not enough data available,
			
 
				+          domain knowledge can be considered to create new artificial data from
			
 
				+          the original data. In the domain of on-line handwriting recognition,
			
 
				+          data can be multiplied by adding rotated variants.
			
 
				+    \item \textbf{Feature extraction}: A feature is high-level information
			
 
				+          derived from the raw data after preprocessing. Some systems like
			
 
				+          Detexify take the result of the preprocessing step, but many compute
			
 
				+          new features. Those features can be designed by a human engineer or
			
 
				+          learned. Non-raw data features have the advantage that less
			
 
				+          training data is needed since the developer uses knowledge about
			
 
				+          handwriting to compute highly discriminative features. Various
			
 
				+          features are explained in \cref{sec:features}.
			
 
				+\end{enumerate}
			
 
				+
			
 
				+After these steps, it is a classification task for which the classifier has to
			
 
				+learn internal parameters before it can classify new recordings.We classified
			
 
				+recordings by computing constant-sized feature vectors and using
			
 
				+\glspl{MLP}. There are many ways to adjust \glspl{MLP} (number of neurons and
			
 
				+layers, activation functions) and their training (learning rate, momentum,
			
 
				+error function). Some of them are described in~\cref{sec:mlp-training} and the
			
 
				+evaluation results are presented in \cref{ch:Optimization-of-System-Design}.
			
--- a/documents/papers/write-math-paper/ch3-data-and-implementation.tex
+++ b/documents/papers/write-math-paper/ch3-data-and-implementation.tex
@@ -0,0 +1,20 @@
 
				+%!TEX root = write-math-ba-paper.tex
			
 
				+
			
 
				+\section{Data and Implementation}
			
 
				+We used $\num{369}$ symbol classes with a total of $\num{166898}$ labeled
			
 
				+recordings. Each class has at least $\num{50}$ labeled recordings, but over
			
 
				+$200$ symbols have more than $\num{200}$ labeled recordings and over $100$
			
 
				+symbols have more than $500$ labeled recordings.
			
 
				+The data was collected by two crowd-sourcing projects (Detexify and
			
 
				+\href{http://write-math.com}{write-math.com}) where users wrote
			
 
				+symbols, were then given a list ordered by an early classification system and
			
 
				+clicked on the symbol they wrote.
			
 
				+
			
 
				+The data of Detexify and \href{http://write-math.com}{write-math.com} was
			
 
				+combined, filtered semi-automatically and can be downloaded via
			
 
				+\href{http://write-math.com/data}{write-math.com/data} as a compressed tar
			
 
				+archive of CSV files.
			
 
				+
			
 
				+All of the following preprocessing and feature computation algorithms were
			
 
				+implemented and are publicly available as open-source software in the Python
			
 
				+package \texttt{hwrt}.
			
--- a/documents/papers/write-math-paper/ch4-algorithms.tex
+++ b/documents/papers/write-math-paper/ch4-algorithms.tex
@@ -0,0 +1,113 @@
 
				+%!TEX root = write-math-ba-paper.tex
			
 
				+
			
 
				+\section{Algorithms}
			
 
				+\subsection{Preprocessing}\label{sec:preprocessing}
			
 
				+Preprocessing in symbol recognition is done to improve the quality and
			
 
				+expressive power of the data. It makes follow-up tasks like feature extraction
			
 
				+and classification easier, more effective or faster. It does so by resolving
			
 
				+errors in the input data, reducing duplicate information and removing
			
 
				+irrelevant information.
			
 
				+
			
 
				+Preprocessing algorithms fall into two groups: Normalization and noise
			
 
				+reduction algorithms.
			
 
				+
			
 
				+A very important normalization algorithm in single-symbol recognition is
			
 
				+\textit{scale-and-shift}~\cite{Thoma:2014}. It scales the recording so that
			
 
				+its bounding box fits into a unit square. As the aspect ratio of a recording is
			
 
				+almost never 1:1, only one dimension will fit exactly in the unit square. For
			
 
				+this paper, it was chosen to shift the recording in the direction of its bigger
			
 
				+dimension into the $[0,1] \times [0,1]$ unit square. After that, the recording
			
 
				+is shifted in direction of its smaller dimension such that its bounding box is
			
 
				+centered around zero.
			
 
				+
			
 
				+Another normalization preprocessing algorithm is
			
 
				+resampling~\cite{Guyon91,Manke01}. As the data points on the pen trajectory are
			
 
				+generated asynchronously and with different time-resolutions depending on the
			
 
				+used hardware and software, it is desirable to resample the recordings to have
			
 
				+points spread equally in time for every recording. This was done by linear
			
 
				+interpolation of the $(x,t)$ and $(y,t)$ sequences and getting a fixed number
			
 
				+of equally spaced points per stroke.
			
 
				+
			
 
				+\textit{Stroke connection} is a noise reduction algorithm which is mentioned
			
 
				+in~\cite{Tappert90}. It happens sometimes that the hardware detects that the
			
 
				+user lifted the pen where the user certainly didn't do so. This can be detected
			
 
				+by measuring the Euclidean distance between the end of one stroke and the
			
 
				+beginning of the next stroke. If this distance is below a threshold, then the
			
 
				+strokes are connected.
			
 
				+
			
 
				+Due to a limited resolution of the recording device and due to erratic
			
 
				+handwriting, the pen trajectory might not be smooth. One way to smooth is
			
 
				+calculating a weighted average and replacing points by the weighted average of
			
 
				+their coordinate and their neighbors coordinates. Another way to do smoothing
			
 
				+is to reduce the number of points with the Douglas-Peucker
			
 
				+algorithm to the points that are more relevant for the
			
 
				+overall shape of a stroke and then interpolate the stroke between those points.
			
 
				+The Douglas-Peucker stroke simplification algorithm is usually used in
			
 
				+cartography to simplify the shape of roads. It works recursively to find a
			
 
				+subset of points of a stroke that is simpler and still similar to the original
			
 
				+shape. The algorithm adds the first and the last point $p_1$ and $p_n$ of a
			
 
				+stroke to the simplified set of points $S$. Then it searches the point $p_i$ in
			
 
				+between that has maximum distance from the line $p_1 p_n$. If this distance is
			
 
				+above a threshold $\varepsilon$, the point $p_i$ is added to $S$. Then the
			
 
				+algorithm gets applied to $p_1 p_i$ and $p_i p_n$ recursively. It is described
			
 
				+as \enquote{Algorithm 1} in~\cite{Visvalingam1990}.
			
 
				+
			
 
				+\subsection{Features}\label{sec:features}
			
 
				+Features can be \textit{global}, that means calculated for the complete
			
 
				+recording or complete strokes. Other features are calculated for single points
			
 
				+on the pen trajectory and are called \textit{local}.
			
 
				+
			
 
				+Global features are the \textit{number of strokes} in a recording, the
			
 
				+\textit{aspect ratio} of a recordings bounding box or the
			
 
				+\textit{ink} being used for a recording. The ink feature gets calculated by
			
 
				+measuring the length of all strokes combined. The re-curvature, which was
			
 
				+introduced in~\cite{Huang06}, is defined as
			
 
				+\[\text{re-curvature}(stroke) := \frac{\text{height}(stroke)}{\text{length}(stroke)}\]
			
 
				+and a stroke-global feature.
			
 
				+
			
 
				+The simplest local feature is the coordinate of the point itself. Speed,
			
 
				+curvature and a local small-resolution bitmap around the point, which was
			
 
				+introduced by Manke, Finke and Waibel in~\cite{Manke1995}, are other local
			
 
				+features.
			
 
				+
			
 
				+\subsection{Multilayer Perceptrons}\label{sec:mlp-training}
			
 
				+\Glspl{MLP} are explained in detail in~\cite{Mitchell97}. They can have
			
 
				+different numbers of hidden layers, the number of neurons per layer and the
			
 
				+activation functions can be varied. The learning algorithm is parameterized by
			
 
				+the learning rate $\eta \in (0, \infty)$, the momentum $\alpha \in [0, \infty)$
			
 
				+and the number of epochs.
			
 
				+
			
 
				+The topology of \glspl{MLP} will be denoted in the following by separating the
			
 
				+number of neurons per layer with colons. For example, the notation
			
 
				+$160{:}500{:}500{:}500{:}369$ means that the input layer gets 160~features,
			
 
				+there are three hidden layers with 500~neurons per layer and one output layer
			
 
				+with 369~neurons.
			
 
				+
			
 
				+\glspl{MLP} training can be executed in various different ways, for example
			
 
				+with \acrfull{SLP}. In case of a \gls{MLP} with the topology
			
 
				+$160{:}500{:}500{:}500{:}369$, \gls{SLP} works as follows: At first a \gls{MLP}
			
 
				+with one hidden layer ($160{:}500{:}369$) is trained. Then the output layer is
			
 
				+discarded, a new hidden layer and a new output layer is added and it is trained
			
 
				+again, resulting in a $160{:}500{:}500{:}369$ \gls{MLP}. The output layer is
			
 
				+discarded again, a new hidden layer is added and a new output layer is added
			
 
				+and the training is executed again.
			
 
				+
			
 
				+Denoising auto-encoders are another way of pretraining. An
			
 
				+\textit{auto-encoder} is a neural network that is trained to restore its input.
			
 
				+This means the number of input neurons is equal to the number of output
			
 
				+neurons. The weights define an \textit{encoding} of the input that allows
			
 
				+restoring the input. As the neural network finds the encoding by itself, it is
			
 
				+called auto-encoder. If the hidden layer is smaller than the input layer, it
			
 
				+can be used for dimensionality reduction~\cite{Hinton1989}. If only one hidden
			
 
				+layer with linear activation functions is used, then the hidden layer contains
			
 
				+the principal components after training~\cite{Duda2001}.
			
 
				+
			
 
				+Denoising auto-encoders are a variant introduced in~\cite{Vincent2008} that
			
 
				+is more robust to partial corruption of the input features. It is trained to
			
 
				+get robust by adding noise to the input features.
			
 
				+
			
 
				+There are multiple ways how noise can be added. Gaussian noise and randomly
			
 
				+masking elements with zero are two possibilities.
			
 
				+\cite{Deeplearning-Denoising-AE} describes how such a denoising auto-encoder
			
 
				+with masking noise can be implemented. The corruption $\varkappa \in [0, 1)$ is
			
 
				+the probability of a feature being masked.
			
--- a/documents/papers/write-math-paper/ch5-optimization-of-system-design.tex
+++ b/documents/papers/write-math-paper/ch5-optimization-of-system-design.tex
@@ -0,0 +1,214 @@
 
				+%!TEX root = write-math-ba-paper.tex
			
 
				+
			
 
				+\section{Optimization of System Design}\label{ch:Optimization-of-System-Design}
			
 
				+In order to evaluate the effect of different preprocessing algorithms, features
			
 
				+and adjustments in the \gls{MLP} training and topology, the following baseline
			
 
				+system was used:
			
 
				+
			
 
				+Scale the recording to fit into a unit square while keeping the aspect ratio,
			
 
				+shift it as described in \cref{sec:preprocessing},
			
 
				+resample it with linear interpolation to get 20~points per stroke, spaced
			
 
				+evenly in time. Take the first 4~strokes with 20~points per stroke and
			
 
				+2~coordinates per point as features, resulting in 160~features which is equal
			
 
				+to the number of input neurons. If a recording has less than 4~strokes, the
			
 
				+remaining features were filled with zeroes.
			
 
				+
			
 
				+All experiments were evaluated with four baseline systems $B_{hl=i}$, $i \in \Set{1,
			
 
				+2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
			
 
				+could have a severe influence on the effect of new features or preprocessing
			
 
				+steps. Each hidden layer in all evaluated systems has $500$ neurons.
			
 
				+
			
 
				+Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
			
 
				+of $\alpha = 0.1$. The activation function of every neuron in a hidden layer is
			
 
				+the sigmoid function. The neurons in the
			
 
				+output layer use the softmax function. For every experiment, exactly one part
			
 
				+of the baseline systems was changed.
			
 
				+
			
 
				+
			
 
				+\subsection{Random Weight Initialization}
			
 
				+The neural networks in all experiments got initialized with a small random
			
 
				+weight
			
 
				+
			
 
				+\[w_{i,j} \sim U(-4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}}, 4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}})\]
			
 
				+
			
 
				+where $w_{i,j}$ is the weight between the neurons $i$ and $j$, $l$ is the layer
			
 
				+of neuron $i$, and $n_i$ is the number of neurons in layer $i$. This random
			
 
				+initialization was suggested on
			
 
				+\cite{deeplearningweights} and is done to break symmetry.
			
 
				+
			
 
				+This can lead to different error rates for the same systems just because the
			
 
				+initialization was different.
			
 
				+
			
 
				+In order to get an impression of the magnitude of the influence on the different
			
 
				+topologies and error rates the baseline models were trained 5 times with
			
 
				+random initializations.
			
 
				+\Cref{table:baseline-systems-random-initializations-summary}
			
 
				+shows a summary of the results. The more hidden layers are used, the more do
			
 
				+the results vary between different random weight initializations.
			
 
				+
			
 
				+\begin{table}[h]
			
 
				+    \centering
			
 
				+    \begin{tabular}{crrr|rrr} %chktex 44
			
 
				+    \toprule
			
 
				+    \multirow{3}{*}{System}  & \multicolumn{6}{c}{Classification error}\\
			
 
				+    \cmidrule(l){2-7}
			
 
				+               & \multicolumn{3}{c}{Top-1}   & \multicolumn{3}{c}{Top-3}\\
			
 
				+               & Min                   & Max                   & Mean                  & Min                  & Max                  & Mean\\\midrule
			
 
				+    $B_{hl=1}$ & $\SI{23.1}{\percent}$ & $\SI{23.4}{\percent}$ & $\SI{23.2}{\percent}$ & $\SI{6.7}{\percent}$ & $\SI{6.8}{\percent}$ & $\SI{6.7}{\percent}$ \\
			
 
				+    $B_{hl=2}$ & \underline{$\SI{21.4}{\percent}$} & \underline{$\SI{21.8}{\percent}$}& \underline{$\SI{21.6}{\percent}$} & $\SI{5.7}{\percent}$ & \underline{$\SI{5.8}{\percent}$} & \underline{$\SI{5.7}{\percent}$}\\
			
 
				+    $B_{hl=3}$ & $\SI{21.5}{\percent}$ & $\SI{22.3}{\percent}$ & $\SI{21.9}{\percent}$ & \underline{$\SI{5.5}{\percent}$} & $\SI{5.8}{\percent}$ & \underline{$\SI{5.7}{\percent}$}\\
			
 
				+    $B_{hl=4}$ & $\SI{23.2}{\percent}$ & $\SI{24.8}{\percent}$ & $\SI{23.9}{\percent}$ & $\SI{6.0}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{6.2}{\percent}$\\
			
 
				+    \bottomrule
			
 
				+    \end{tabular}
			
 
				+    \caption{The systems $B_{hl=1}$ -- $B_{hl=4}$ were randomly initialized,
			
 
				+             trained and evaluated 5~times to estimate the influence of random
			
 
				+             weight initialization.}
			
 
				+\label{table:baseline-systems-random-initializations-summary}
			
 
				+\end{table}
			
 
				+
			
 
				+
			
 
				+\subsection{Stroke connection}
			
 
				+In order to solve the problem of interrupted strokes, pairs of strokes
			
 
				+can be connected with stroke connection algorithm. The idea is that for
			
 
				+a pair of consecutively drawn strokes $s_{i}, s_{i+1}$ the last point $s_i$ is
			
 
				+close to the first point of $s_{i+1}$ if a stroke was accidentally split
			
 
				+into two strokes.
			
 
				+
			
 
				+$\SI{59}{\percent}$ of all stroke pair distances in the collected data are
			
 
				+between $\SI{30}{\pixel}$ and $\SI{150}{\pixel}$. Hence the stroke connection
			
 
				+algorithm was evaluated with $\SI{5}{\pixel}$, $\SI{10}{\pixel}$ and
			
 
				+$\SI{20}{\pixel}$.
			
 
				+All models top-3 error improved with a threshold of $\theta = \SI{10}{\pixel}$
			
 
				+by at least $\num{0.2}$ percentage points, except $B_{hl=4}$ which did not notably
			
 
				+improve.
			
 
				+
			
 
				+
			
 
				+\subsection{Douglas-Peucker Smoothing}
			
 
				+The Douglas-Peucker algorithm was applied with a threshold of $\varepsilon =
			
 
				+0.05$, $\varepsilon = 0.1$ and $\varepsilon = 0.2$ after scaling and shifting,
			
 
				+but before resampling. The interpolation in the resampling step was done
			
 
				+linearly and with cubic splines in two experiments. The recording was scaled
			
 
				+and shifted again after the interpolation because the bounding box might have
			
 
				+changed.
			
 
				+
			
 
				+The result of the application of the Douglas-Peucker smoothing with $\varepsilon
			
 
				+> 0.05$ was a high rise of the top-1 and top-3 error for all models $B_{hl=i}$.
			
 
				+This means that the simplification process removes some relevant information and
			
 
				+does not---as it was expected---remove only noise. For $\varepsilon = 0.05$
			
 
				+with linear interpolation some models top-1 error improved, but the
			
 
				+changes were small. It could be an effect of random weight initialization.
			
 
				+However, cubic spline interpolation made all systems perform more than
			
 
				+$\num{1.7}$ percentage points worse for top-1 and top-3 error.
			
 
				+
			
 
				+The lower the value of $\varepsilon$, the less does the recording change after
			
 
				+this preprocessing step. As it was applied after scaling the recording such that
			
 
				+the biggest dimension of the recording (width or height) is $1$, a value of
			
 
				+$\varepsilon = 0.05$ means that a point has to move at least $\SI{5}{\percent}$
			
 
				+of the biggest dimension.
			
 
				+
			
 
				+
			
 
				+\subsection{Global Features}
			
 
				+Single global features were added one at a time to the baseline systems. Those
			
 
				+features were re-curvature
			
 
				+$\text{re-curvature}(stroke) = \frac{\text{height}(stroke)}{\text{length}(stroke)}$
			
 
				+as described in \cite{Huang06}, the ink feature which is the summed length
			
 
				+of all strokes, the stroke count, the aspect ratio and the stroke center points
			
 
				+for the first four strokes. The stroke center point feature improved the system
			
 
				+$B_{hl=1}$ by $\num{0.3}$~percentage points for the top-3 error and system $B_{hl=3}$ for
			
 
				+the top-1 error by $\num{0.7}$~percentage points, but all other systems and
			
 
				+error measures either got worse or did not improve much.
			
 
				+
			
 
				+The other global features did improve the systems $B_{hl=1}$ -- $B_{hl=3}$, but not
			
 
				+$B_{hl=4}$. The highest improvement was achieved with the re-curvature feature. It
			
 
				+improved the systems $B_{hl=1}$ -- $B_{hl=4}$ by more than $\num{0.6}$~percentage points
			
 
				+top-1 error.
			
 
				+
			
 
				+
			
 
				+\subsection{Data Multiplication}
			
 
				+Data multiplication can be used to make the model invariant to transformations.
			
 
				+However, this idea seems not to work well in the domain of on-line handwritten
			
 
				+mathematical symbols. We tripled the data by adding a version that is rotated
			
 
				+3~degrees to the left and another one that is rotated 3~degrees to the right
			
 
				+around the center of mass. This data multiplication made all classifiers for
			
 
				+most error measures perform worse by more than $\num{2}$~percentage points for
			
 
				+the top-1 error.
			
 
				+
			
 
				+The same experiment was executed by rotating by 6~degrees and in another
			
 
				+experiment by 9~degrees, but those performed even worse.
			
 
				+
			
 
				+Also multiplying the data by a factor of 5 by adding two 3-degree rotated
			
 
				+variants and two 6-degree rotated variant made the classifier perform worse
			
 
				+by more than $\num{2}$~percentage points.
			
 
				+
			
 
				+
			
 
				+\subsection{Pretraining}\label{subsec:pretraining-evaluation}
			
 
				+Pretraining is a technique used to improve the training of \glspl{MLP} with
			
 
				+multiple hidden layers.
			
 
				+
			
 
				+\Cref{table:pretraining-slp} shows that \gls{SLP} improves the classification
			
 
				+performance by $\num{1.6}$ percentage points for the top-1 error and
			
 
				+$\num{1.0}$ percentage points for the top-3 error. As one can see in
			
 
				+\cref{fig:training-and-test-error-for-different-topologies-pretraining}, this
			
 
				+is not only the case because of the longer training as the test error is
			
 
				+relatively stable after $\num{1000}$ epochs of training. This was confirmed
			
 
				+by an experiment where the baseline systems where trained for $\num{10000}$
			
 
				+epochs and did not perform notably different.
			
 
				+
			
 
				+\begin{figure}[htb]
			
 
				+    \centering
			
 
				+    \input{figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex}
			
 
				+    \caption{Training- and test error by number of trained epochs for different
			
 
				+             topologies with \acrfull{SLP}. The plot shows
			
 
				+             that all pretrained systems performed much better than the systems
			
 
				+             without pretraining. All plotted systems did not improve
			
 
				+             with more epochs of training.}
			
 
				+\label{fig:training-and-test-error-for-different-topologies-pretraining}
			
 
				+\end{figure}
			
 
				+
			
 
				+\begin{table}[tb]
			
 
				+    \centering
			
 
				+    \begin{tabular}{lrrrr}
			
 
				+    \toprule
			
 
				+    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
			
 
				+    \cmidrule(l){2-5}
			
 
				+                & Top-1                  & Change               & Top-3                & Change                 \\\midrule
			
 
				+    $B_{hl=1}$     & $\SI{23.2}{\percent}$  & -                    & $\SI{6.7}{\percent}$ & - \\
			
 
				+    $B_{hl=2,SLP}$ & $\SI{19.9}{\percent}$ & $\SI{-1.7}{\percent}$ & $\SI{4.7}{\percent}$ & $\SI{-1.0}{\percent}$\\
			
 
				+    $B_{hl=3,SLP}$ & \underline{$\SI{19.4}{\percent}$} & $\SI{-2.5}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.1}{\percent}$\\
			
 
				+    $B_{hl=4,SLP}$ & $\SI{19.6}{\percent}$ & $\SI{-4.3}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.6}{\percent}$\\
			
 
				+    \bottomrule
			
 
				+    \end{tabular}
			
 
				+    \caption{Systems with 1--4 hidden layers which used \acrfull{SLP}
			
 
				+             compared to the mean of systems $B_{hl=1}$--$B_{hl=4}$ displayed
			
 
				+             in \cref{table:baseline-systems-random-initializations-summary}
			
 
				+             which used pure gradient descent. The \gls{SLP}
			
 
				+             systems clearly performed worse.}
			
 
				+\label{table:pretraining-slp}
			
 
				+\end{table}
			
 
				+
			
 
				+
			
 
				+Pretraining with denoising auto-encoder lead to the much worse results listed in
			
 
				+\cref{table:pretraining-denoising-auto-encoder}. The first layer used a $\tanh$
			
 
				+activation function. Every layer was trained for $1000$ epochs and the
			
 
				+\gls{MSE} loss function. A learning-rate of $\eta = 0.001$, a corruption of
			
 
				+$\varkappa = 0.3$ and a $L_2$ regularization of $\lambda = 10^{-4}$ were
			
 
				+chosen. This pretraining setup made all systems with all error measures perform
			
 
				+much worse.
			
 
				+
			
 
				+\begin{table}[tb]
			
 
				+    \centering
			
 
				+    \begin{tabular}{lrrrr}
			
 
				+    \toprule
			
 
				+    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
			
 
				+    \cmidrule(l){2-5}
			
 
				+                 & Top-1                  & Change               & Top-3                & Change                 \\\midrule
			
 
				+    $B_{hl=1,AEP}$ & $\SI{23.8}{\percent}$ & $\SI{+0.6}{\percent}$ & $\SI{7.2}{\percent}$ & $\SI{+0.5}{\percent}$\\
			
 
				+    $B_{hl=2,AEP}$ & \underline{$\SI{22.8}{\percent}$} & $\SI{+1.2}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{+0.7}{\percent}$\\
			
 
				+    $B_{hl=3,AEP}$ & $\SI{23.1}{\percent}$ & $\SI{+1.2}{\percent}$ & \underline{$\SI{6.1}{\percent}$} & $\SI{+0.4}{\percent}$\\
			
 
				+    $B_{hl=4,AEP}$ & $\SI{25.6}{\percent}$ & $\SI{+1.7}{\percent}$ & $\SI{7.0}{\percent}$ & $\SI{+0.8}{\percent}$\\
			
 
				+    \bottomrule
			
 
				+    \end{tabular}
			
 
				+    \caption{Systems with denoising \acrfull{AEP} compared to pure
			
 
				+             gradient descent. The \gls{AEP} systems performed worse.}
			
 
				+\label{table:pretraining-denoising-auto-encoder}
			
 
				+\end{table}
			
--- a/documents/papers/write-math-paper/ch6-summary.tex
+++ b/documents/papers/write-math-paper/ch6-summary.tex
@@ -0,0 +1,123 @@
 
				+%!TEX root = write-math-ba-paper.tex
			
 
				+
			
 
				+\section{Summary}
			
 
				+Four baseline recognition systems were adjusted in many experiments and their
			
 
				+recognition capabilities were compared in order to build a recognition system
			
 
				+that can recognize 396 mathematical symbols with low error rates as well as to
			
 
				+evaluate which preprocessing steps and features help to improve the recognition
			
 
				+rate.
			
 
				+
			
 
				+All recognition systems were trained and evaluated with
			
 
				+$\num{\totalCollectedRecordings{}}$ recordings for \totalClassesAnalyzed{}
			
 
				+symbols. These recordings were collected by two crowdsourcing projects
			
 
				+(\href{http://detexify.kirelabs.org/classify.html}{Detexify} and
			
 
				+\href{write-math.com}{write-math.com}) and created with various devices. While
			
 
				+some recordings were created with standard touch devices such as tablets and
			
 
				+smartphones, others were created with the mouse.
			
 
				+
			
 
				+\Glspl{MLP} were used for the classification task. Four baseline systems with
			
 
				+different numbers of hidden layers were used, as the number of hidden layer
			
 
				+influences the capabilities and problems of \glspl{MLP}.
			
 
				+
			
 
				+All baseline systems used the same preprocessing queue. The recordings were
			
 
				+scaled and shifted as described in \ref{sec:preprocessing}, resampled with
			
 
				+linear interpolation so that every stroke had exactly 20~points which are
			
 
				+spread equidistant in time. The 80~($x,y$) coordinates of the first 4~strokes
			
 
				+were used to get exactly $160$ input features for every recording. The baseline
			
 
				+system $B_{hl=2}$ has a top-3 error of $\SI{5.7}{\percent}$.
			
 
				+
			
 
				+Adding two slightly rotated variants for each recording and hence tripling the
			
 
				+training set made the systems $B_{hl=3}$ and $B_{hl=4}$ perform much worse, but
			
 
				+improved the performance of the smaller systems.
			
 
				+
			
 
				+The global features re-curvature, ink, stoke count and aspect ratio improved
			
 
				+the systems $B_{hl=1}$--$B_{hl=3}$, whereas the stroke center point feature
			
 
				+made $B_{hl=2}$ perform worse.
			
 
				+
			
 
				+Denoising auto-encoders were evaluated as one way to use pretraining, but by
			
 
				+this the error rate increased notably. However, \acrlong{SLP} improved the
			
 
				+performance decidedly.
			
 
				+
			
 
				+The stroke connection algorithm was added to the preprocessing steps of the
			
 
				+baseline system as well as the re-curvature feature, the ink feature, the
			
 
				+number of strokes and the aspect ratio. The training setup of the baseline
			
 
				+system was changed to \acrlong{SLP} and the resulting model was trained with a
			
 
				+lower learning rate again. This optimized recognizer $B_{hl=2,c}'$ had a top-3
			
 
				+error of $\SI{4.0}{\percent}$. This means that the top-3 error dropped by over
			
 
				+$\num{1.7}$ percentage points in comparison to the baseline system $B_{hl=2}$.
			
 
				+
			
 
				+A top-3 error of $\SI{4.0}{\percent}$ makes the system usable for symbol
			
 
				+lookup. It could also be used as a starting point for the development of a
			
 
				+multiple-symbol classifier.
			
 
				+
			
 
				+The aim of this work was to develop a symbol recognition system which is easy
			
 
				+to use, fast and has high recognition rates as well as evaluating ideas for
			
 
				+single symbol classifiers. Some of those goals were reached. The recognition
			
 
				+system $B_{hl=2,c}'$ evaluates new recordings in a fraction of a second and has
			
 
				+acceptable recognition rates.
			
 
				+
			
 
				+% Many algorithms were evaluated. However, there are still many other
			
 
				+% algorithms which could be evaluated and, at the time of this work, the best
			
 
				+% classifier $B_{hl=2,c}'$ is only available through the Python package
			
 
				+% \texttt{hwrt}. It is planned to add an web version of that classifier online.
			
 
				+
			
 
				+\section{Optimized Recognizer}
			
 
				+All preprocessing steps and features that were useful were combined to create a
			
 
				+recognizer that performs best.
			
 
				+
			
 
				+All models were much better than everything that was tried before. The results
			
 
				+of this experiment show that single-symbol recognition with
			
 
				+\totalClassesAnalyzed{} classes and usual touch devices and the mouse can be
			
 
				+done with a top-1 error rate of $\SI{18.6}{\percent}$ and a top-3 error of
			
 
				+$\SI{4.1}{\percent}$. This was
			
 
				+achieved by a \gls{MLP} with a $167{:}500{:}500{:}\totalClassesAnalyzed{}$ topology.
			
 
				+
			
 
				+It used the stroke connection algorithm to connect of which the ends were less
			
 
				+than $\SI{10}{\pixel}$ away, scaled each recording to a unit square and shifted
			
 
				+as described in \ref{sec:preprocessing}. After that, a linear resampling step
			
 
				+was applied to the first 4 strokes to resample them to 20 points each. All
			
 
				+other strokes were discarded.
			
 
				+
			
 
				+\goodbreak
			
 
				+The 167 features were\mynobreakpar%
			
 
				+\begin{itemize}
			
 
				+     \item the first 4 strokes with 20 points per stroke resulting in 160
			
 
				+           features,
			
 
				+     \item the re-curvature for the first 4 strokes,
			
 
				+     \item the ink,
			
 
				+     \item the number of strokes and
			
 
				+     \item the aspect ratio of the bounding box
			
 
				+\end{itemize}
			
 
				+
			
 
				+\Gls{SLP} was applied with $\num{1000}$ epochs per layer, a
			
 
				+learning rate of $\eta=0.1$ and a momentum of $\alpha=0.1$. After that, the
			
 
				+complete model was trained again for $1000$ epochs with standard mini-batch
			
 
				+gradient descent resulting in systems $B_{hl=1,c}'$ -- $B_{hl=4,c}'$.
			
 
				+
			
 
				+After the models $B_{hl=1,c}$ -- $B_{hl=4,c}$ were trained the first $1000$ epochs,
			
 
				+they were trained again for $\num{1000}$ epochs with a learning rate of $\eta =
			
 
				+0.05$. \Cref{table:complex-recognizer-systems-evaluation} shows that
			
 
				+this improved the classifiers again.
			
 
				+
			
 
				+\begin{table}[htb]
			
 
				+    \centering
			
 
				+    \begin{tabular}{lrrrr}
			
 
				+    \toprule
			
 
				+    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
			
 
				+    \cmidrule(l){2-5}
			
 
				+              & Top-1                 & Change                & Top-3                & Change\\\midrule
			
 
				+    $B_{hl=1,c}$ & $\SI{21.0}{\percent}$ & $\SI{-2.2}{\percent}$ & $\SI{5.2}{\percent}$ & $\SI{-1.5}{\percent}$\\
			
 
				+    $B_{hl=2,c}$ & $\SI{18.3}{\percent}$ & $\SI{-3.3}{\percent}$ & $\SI{4.1}{\percent}$ & $\SI{-1.6}{\percent}$\\
			
 
				+    $B_{hl=3,c}$ & \underline{$\SI{18.2}{\percent}$} & $\SI{-3.7}{\percent}$ & \underline{$\SI{4.1}{\percent}$} & $\SI{-1.6}{\percent}$\\
			
 
				+    $B_{hl=4,c}$ & $\SI{18.6}{\percent}$ & $\SI{-5.3}{\percent}$ & $\SI{4.3}{\percent}$ & $\SI{-1.9}{\percent}$\\\midrule
			
 
				+    $B_{hl=1,c}'$ & $\SI{19.3}{\percent}$ & $\SI{-3.9}{\percent}$ & $\SI{4.8}{\percent}$ & $\SI{-1.9}{\percent}$ \\
			
 
				+    $B_{hl=2,c}'$ & \underline{$\SI{17.5}{\percent}$} & $\SI{-4.1}{\percent}$ & \underline{$\SI{4.0}{\percent}$} & $\SI{-1.7}{\percent}$\\
			
 
				+    $B_{hl=3,c}'$ & $\SI{17.7}{\percent}$ & $\SI{-4.2}{\percent}$ & $\SI{4.1}{\percent}$ & $\SI{-1.6}{\percent}$\\
			
 
				+    $B_{hl=4,c}'$ & $\SI{17.8}{\percent}$ & $\SI{-6.1}{\percent}$ & $\SI{4.3}{\percent}$ & $\SI{-1.9}{\percent}$\\
			
 
				+    \bottomrule
			
 
				+    \end{tabular}
			
 
				+    \caption{Error rates of the optimized recognizer systems. The systems
			
 
				+             $B_{hl=i,c}'$ were trained another $\num{1000}$ epochs with a learning rate
			
 
				+             of $\eta=0.05$.}
			
 
				+\label{table:complex-recognizer-systems-evaluation}
			
 
				+\end{table}
			
--- a/documents/papers/write-math-paper/ch7-mfrdb-eval.tex
+++ b/documents/papers/write-math-paper/ch7-mfrdb-eval.tex
@@ -0,0 +1,32 @@
 
				+%!TEX root = write-math-ba-paper.tex
			
 
				+
			
 
				+\section{Evaluation}
			
 
				+
			
 
				+The optimized classifier was evaluated on three publicly available data sets:
			
 
				+\verb+MfrDB_Symbols_v1.0+ \cite{Stria2012}, CROHME~2011 \cite{Mouchere2011},
			
 
				+and CROHME~2012 \cite{Mouchere2012}.
			
 
				+
			
 
				+\verb+MfrDB_Symbols_v1.0+ contains recordings for 105~symbols, but for
			
 
				+11~symbols less than 50~recordings were available. For this reason, the
			
 
				+optimized classifier was evaluated on 94~of the 105~symbols.
			
 
				+
			
 
				+The evaluation results are given in \cref{table:public-eval-results}.
			
 
				+
			
 
				+\begin{table}[htb]
			
 
				+    \centering
			
 
				+    \begin{tabular}{lcrr}
			
 
				+    \toprule
			
 
				+    \multirow{2}{*}{Dataset}  & \multirow{2}{*}{Symbols}  & \multicolumn{2}{c}{Classification error}\\
			
 
				+    \cmidrule(l){3-4}
			
 
				+              & & Top-1                 & Top-3                \\\midrule
			
 
				+    MfrDB       & 94 & $\SI{8.4}{\percent}$  & $\SI{1.3}{\percent}$ \\
			
 
				+    CROHME 2011 & 56 & $\SI{10.2}{\percent}$ & $\SI{3.7}{\percent}$ \\
			
 
				+    CROHME 2012 & 75 & $\SI{12.2}{\percent}$ & $\SI{4.1}{\percent}$ \\
			
 
				+    \bottomrule
			
 
				+    \end{tabular}
			
 
				+    \caption{Error rates of the optimized recognizer systems. The systems
			
 
				+             output layer was adjusted to the number of symbols it should
			
 
				+             recognize and trained with the combined data from
			
 
				+             write-math and the training given by the datasets.}
			
 
				+\label{table:public-eval-results}
			
 
				+\end{table}
			
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/Makefile
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/Makefile
@@ -0,0 +1,35 @@
 
				+SOURCE = errors-by-epoch-pretraining
			
 
				+DELAY = 80
			
 
				+DENSITY = 300
			
 
				+WIDTH = 512
			
 
				+
			
 
				+make:
			
 
				+	pdflatex $(SOURCE).tex -output-format=pdf
			
 
				+	make clean
			
 
				+
			
 
				+clean:
			
 
				+	rm -rf  $(TARGET) *.class *.html *.log *.aux *.data *.gnuplot
			
 
				+
			
 
				+gif:
			
 
				+	pdfcrop $(SOURCE).pdf
			
 
				+	convert -verbose -delay $(DELAY) -loop 0 -density $(DENSITY) $(SOURCE)-crop.pdf $(SOURCE).gif
			
 
				+	make clean
			
 
				+
			
 
				+png:
			
 
				+	make
			
 
				+	make svg
			
 
				+	inkscape $(SOURCE).svg -w $(WIDTH) --export-png=$(SOURCE).png
			
 
				+
			
 
				+transparentGif:
			
 
				+	convert $(SOURCE).pdf -transparent white result.gif
			
 
				+	make clean
			
 
				+
			
 
				+svg:
			
 
				+	make
			
 
				+	#inkscape $(SOURCE).pdf --export-plain-svg=$(SOURCE).svg
			
 
				+	pdf2svg $(SOURCE).pdf $(SOURCE).svg
			
 
				+	# Necessary, as pdf2svg does not always create valid svgs:
			
 
				+	inkscape $(SOURCE).svg --export-plain-svg=$(SOURCE).svg
			
 
				+	rsvg-convert -a -w $(WIDTH) -f svg $(SOURCE).svg -o $(SOURCE)2.svg
			
 
				+	inkscape $(SOURCE)2.svg --export-plain-svg=$(SOURCE).svg
			
 
				+	rm $(SOURCE)2.svg
			
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-1.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-1.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2-pretraining.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2-pretraining.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-3-pretraining.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-3-pretraining.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-4-pretraining.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-4-pretraining.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex
@@ -0,0 +1,31 @@
 
				+\begin{tikzpicture}
			
 
				+    \begin{axis}[
			
 
				+            axis x line=middle,
			
 
				+            axis y line=middle,
			
 
				+            enlarge y limits=true,
			
 
				+            xmin=0,
			
 
				+            % xmax=1000,
			
 
				+            ymin=0.18, ymax=0.4,
			
 
				+            minor ytick={0, 0.01, ..., 1},
			
 
				+            % width=15cm, height=8cm,     % size of the image
			
 
				+            grid = both,
			
 
				+            minor grid style={dashed, gray!30},
			
 
				+            major grid style={gray!40},,
			
 
				+            %grid style={dashed, gray!30},
			
 
				+            ylabel=error,
			
 
				+            xlabel=epoch,
			
 
				+            legend cell align=left,
			
 
				+            legend style={
			
 
				+                at={(0.5,-0.1)},
			
 
				+                anchor=north,
			
 
				+                legend columns=2
			
 
				+            }
			
 
				+         ]
			
 
				+          \addplot[mark=x,green] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-1.csv};
			
 
				+          \addplot[mark=x,orange] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-2.csv};
			
 
				+          \addplot[mark=x,red] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-2-pretraining.csv};
			
 
				+          \legend{{1 hidden layer},
			
 
				+                  {2 hidden layers},
			
 
				+                  {2 hidden layers with pretraining}}
			
 
				+    \end{axis}
			
 
				+\end{tikzpicture}
			
--- a/documents/papers/write-math-paper/glossary.tex
+++ b/documents/papers/write-math-paper/glossary.tex
@@ -0,0 +1,74 @@
 
				+%!TEX root = thesis.tex
			
 
				+%Term definitions
			
 
				+\newacronym{ANN}{ANN}{artificial neural network}
			
 
				+\newacronym{CSR}{CSR}{cursive script recognition}
			
 
				+\newacronym{DTW}{DTW}{dynamic time warping}
			
 
				+\newacronym{GTW}{GTW}{greedy time warping}
			
 
				+\newacronym{HMM}{HMM}{hidden Markov model}
			
 
				+\newacronym{HWR}{HWR}{handwriting recognition}
			
 
				+\newacronym{HWRT}{HWRT}{handwriting recognition toolkit}
			
 
				+\newacronym{MLP}{MLP}{multilayer perceptron}
			
 
				+\newacronym{MSE}{MSE}{mean squared error}
			
 
				+\newacronym{OOV}{OOV}{out of vocabulary}
			
 
				+\newacronym{TDNN}{TDNN}{time delay neural network}
			
 
				+\newacronym{PCA}{PCA}{principal component analysis}
			
 
				+\newacronym{LDA}{LDA}{linear discriminant analysis}
			
 
				+\newacronym{CROHME}{CROHME}{Competition on Recognition of Online Handwritten Mathematical Expressions}
			
 
				+\newacronym{GMM}{GMM}{Gaussian mixture model}
			
 
				+\newacronym{SVM}{SVM}{support vector machine}
			
 
				+\newacronym{PyPI}{PyPI}{Python Package Index}
			
 
				+\newacronym{CFM}{CFM}{classification figure of merit}
			
 
				+\newacronym{CE}{CE}{cross entropy}
			
 
				+\newacronym{GPU}{GPU}{graphics processing unit}
			
 
				+\newacronym{CUDA}{CUDA}{Compute Unified Device Architecture}
			
 
				+\newacronym{SLP}{SLP}{supervised layer-wise pretraining}
			
 
				+\newacronym{AEP}{AEP}{auto-encoder pretraining}
			
 
				+
			
 
				+% Term definitions
			
 
				+\newglossaryentry{Detexify}{name={Detexify}, description={A system used for
			
 
				+on-line handwritten symbol recognition which is described in \cite{Kirsch}}}
			
 
				+
			
 
				+\newglossaryentry{epoch}{name={epoch}, description={During iterative training of a neural network, an \textit{epoch} is a single pass through the entire training set, followed by testing of the verification set.\cite{Concise12}}}
			
 
				+
			
 
				+\newglossaryentry{hypothesis}{
			
 
				+    name={hypothesis},
			
 
				+    description={The recognition results which a classifier returns is called a hypothesis. In other words, it is the \enquote{guess} of a classifier},
			
 
				+    plural=hypotheses
			
 
				+}
			
 
				+
			
 
				+\newglossaryentry{reference}{
			
 
				+    name={reference},
			
 
				+    description={Labeled data is used to evaluate classifiers. Those labels are called references},
			
 
				+}
			
 
				+
			
 
				+\newglossaryentry{YAML}{name={YAML}, description={YAML is a human-readable data format that can be used for configuration files}}
			
 
				+\newglossaryentry{MER}{name={MER}, description={An error measure which combines symbols to equivalence classes. It was introduced on \cpageref{merged-error-introduction}}}
			
 
				+
			
 
				+\newglossaryentry{JSON}{name={JSON}, description={JSON, short for JavaScript Object Notation, is a language-independent data format that can be used to transmit data between a server and a client in web applications}}
			
 
				+
			
 
				+\newglossaryentry{hyperparamter}{name={hyperparamter}, description={A
			
 
				+\textit{hyperparamter} is a parameter of a neural net, that cannot be learned,
			
 
				+but has to be chosen}, symbol={\ensuremath{\theta}}}
			
 
				+
			
 
				+\newglossaryentry{learning rate}{name={learning rate}, description={A factor $0 \leq \eta \in \mdr$ that affects how fast new weights are learned. $\eta=0$ means that no new data is learned}, symbol={\ensuremath{\eta}}} % Andrew Ng: \alpha
			
 
				+
			
 
				+\newglossaryentry{learning rate decay}{name={learning rate decay}, description={The learning rate decay $0 < \alpha \leq 1$ is used to adjust the learning rate. After each epoch the learning rate $\eta$ is updated to $\eta \gets \eta \times \alpha$}, symbol={\ensuremath{\eta}}}
			
 
				+
			
 
				+\newglossaryentry{preactivation}{name={preactivation}, description={The preactivation of a neuron is the weighted sum of its input, before the activation function is applied}}
			
 
				+
			
 
				+\newglossaryentry{stroke}{name={stroke}, description={The path the pen took from
			
 
				+the point where the pen was put down to the point where the pen was lifted first}}
			
 
				+
			
 
				+\newglossaryentry{line}{name={line}, description={Geometric object that is infinitely long
			
 
				+and defined by two points.}}
			
 
				+
			
 
				+\newglossaryentry{line segment}{name={line segment}, description={Geometric object that has finite length
			
 
				+and defined by two points.}}
			
 
				+
			
 
				+\newglossaryentry{symbol}{name={symbol}, description={An atomic semantic entity. A more detailed description can be found in \cref{sec:what-is-a-symbol}}}
			
 
				+
			
 
				+\newglossaryentry{weight}{name={weight}, description={A
			
 
				+\textit{weight} is a parameter of a neural net, that can be learned}, symbol={\ensuremath{\weight}}}
			
 
				+
			
 
				+\newglossaryentry{control point}{name={control point}, description={A
			
 
				+\textit{control point} is a point recorded by the input device.}}
			
--- a/documents/papers/write-math-paper/sRGBIEC1966-2.1.icm
+++ b/documents/papers/write-math-paper/sRGBIEC1966-2.1.icm
--- a/documents/papers/write-math-paper/variables.tex
+++ b/documents/papers/write-math-paper/variables.tex
@@ -0,0 +1,12 @@
 
				+\newcommand{\totalCollectedRecordings}{166898}  % ACTUALITY
			
 
				+\newcommand{\detexifyCollectedRecordings}{153423}
			
 
				+\newcommand{\trainingsetsize}{134804}
			
 
				+\newcommand{\validtionsetsize}{15161}
			
 
				+\newcommand{\testsetsize}{17012}
			
 
				+\newcommand{\totalClasses}{1111}
			
 
				+\newcommand{\totalClassesAnalyzed}{369}
			
 
				+\newcommand{\totalClassesAboveFifty}{680}
			
 
				+\newcommand{\totalClassesNotAnalyzedBelowFifty}{431}
			
 
				+\newcommand{\detexifyPercentage}{$\SI{91.93}{\percent}$}
			
 
				+\newcommand{\recordingsWithDots}{$\SI{2.77}{\percent}$}  % excluding i,j, ...
			
 
				+\newcommand{\recordingsWithDotsSizechange}{$\SI{0.85}{\percent}$}  % excluding i,j, ...
			
--- a/documents/papers/write-math-paper/write-math-ba-paper.bib
+++ b/documents/papers/write-math-paper/write-math-ba-paper.bib
--- a/documents/papers/write-math-paper/write-math-ba-paper.tex
+++ b/documents/papers/write-math-paper/write-math-ba-paper.tex
@@ -0,0 +1,87 @@
 
				+\documentclass[9pt,technote,a4paper]{IEEEtran}
			
 
				+\usepackage{amssymb, amsmath} % needed for math
			
 
				+
			
 
				+\usepackage[a-1b]{pdfx}
			
 
				+\usepackage{filecontents}
			
 
				+\begin{filecontents*}{\jobname.xmpdata}
			
 
				+    \Keywords{recognition\sep machine learning\sep neural networks\sep symbols\sep multilayer perceptron}
			
 
				+    \Title{On-line Recognition of Handwritten Mathematical Symbols}
			
 
				+    \Author{Martin Thoma, Kevin Kilgour, Sebastian St{\"u}ker and Alexander Waibel}
			
 
				+    \Org{Institute for Anthropomatics and Robotics}
			
 
				+    \Doi{}
			
 
				+\end{filecontents*}
			
 
				+
			
 
				+\RequirePackage{ifpdf}
			
 
				+\ifpdf \PassOptionsToPackage{pdfpagelabels}{hyperref} \fi
			
 
				+\RequirePackage{hyperref}
			
 
				+\usepackage{parskip}
			
 
				+\usepackage[pdftex,final]{graphicx}
			
 
				+\usepackage{csquotes}
			
 
				+\usepackage{braket}
			
 
				+\usepackage{booktabs}
			
 
				+\usepackage{multirow}
			
 
				+\usepackage{pgfplots}
			
 
				+\usepackage{wasysym}
			
 
				+\usepackage{caption}
			
 
				+% \captionsetup{belowskip=12pt,aboveskip=4pt}
			
 
				+\makeatletter
			
 
				+\newcommand\mynobreakpar{\par\nobreak\@afterheading}
			
 
				+\makeatother
			
 
				+\usepackage[noadjust]{cite}
			
 
				+\usepackage[nameinlink,noabbrev]{cleveref} % has to be after hyperref, ntheorem, amsthm
			
 
				+\usepackage[binary-units,group-separator={,}]{siunitx}
			
 
				+\sisetup{per-mode=fraction,binary-units=true}
			
 
				+\DeclareSIUnit\pixel{px}
			
 
				+\usepackage{glossaries}
			
 
				+\loadglsentries[main]{glossary}
			
 
				+\makeglossaries
			
 
				+
			
 
				+\title{On-line Recognition of Handwritten Mathematical Symbols}
			
 
				+\author{Martin Thoma, Kevin Kilgour, Sebastian St{\"u}ker and Alexander Waibel}
			
 
				+
			
 
				+\hypersetup{
			
 
				+  pdfauthor   = {Martin Thoma\sep Kevin Kilgour\sep Sebastian St{\"u}ker\sep Alexander Waibel},
			
 
				+  pdfkeywords = {recognition\sep machine learning\sep neural networks\sep symbols\sep multilayer perceptron},
			
 
				+  pdfsubject  = {Recognition},
			
 
				+  pdftitle    = {On-line Recognition of Handwritten Mathematical Symbols},
			
 
				+}
			
 
				+\include{variables}
			
 
				+\crefname{table}{Table}{Tables}
			
 
				+\crefname{figure}{Figure}{Figures}
			
 
				+
			
 
				+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
			
 
				+% Begin document                                                    %
			
 
				+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
			
 
				+\begin{document}
			
 
				+\maketitle
			
 
				+\begin{abstract}
			
 
				+The automatic recognition of single handwritten symbols has three main
			
 
				+applications: Supporting users who know how a symbol looks like, but not what
			
 
				+its name is, providing the necessary commands for professional publishing, or
			
 
				+as a building block for formula recognition.
			
 
				+
			
 
				+This paper presents a system which uses the pen trajectory to classify
			
 
				+handwritten symbols. Five preprocessing steps, one data multiplication
			
 
				+algorithm, five features and five variants for multilayer Perceptron training
			
 
				+were evaluated using $\num{166898}$ recordings. Those recordings were made
			
 
				+publicly available. The evaluation results of these 21~experiments were used to
			
 
				+create an optimized recognizer which has a top-1 error of less than
			
 
				+$\SI{17.5}{\percent}$ and a top-3 error of $\SI{4.0}{\percent}$. This is a
			
 
				+relative improvement of $\SI{18.5}{\percent}$ for the top-1 error and
			
 
				+$\SI{29.7}{\percent}$ for the top-3 error compared to the baseline system. This
			
 
				+improvement was achieved by \acrlong{SLP} and adding new features. The
			
 
				+improved classifier can be used via \href{http://write-math.com/}{write-math.com}.
			
 
				+\end{abstract}
			
 
				+
			
 
				+\input{ch1-introduction}
			
 
				+\input{ch2-general-system-design}
			
 
				+\input{ch3-data-and-implementation}
			
 
				+\input{ch4-algorithms}
			
 
				+\input{ch5-optimization-of-system-design}
			
 
				+\input{ch6-summary}
			
 
				+\input{ch7-mfrdb-eval}
			
 
				+
			
 
				+
			
 
				+\bibliographystyle{IEEEtranSA}
			
 
				+\bibliography{write-math-ba-paper}
			
 
				+\end{document}