| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549 |
- \documentclass[9pt,technote]{IEEEtran}
- \usepackage{amssymb, amsmath} % needed for math
- \usepackage{hyperref} % links im text
- \usepackage{parskip}
- \usepackage[pdftex,final]{graphicx}
- \usepackage{csquotes}
- \usepackage{braket}
- \usepackage{booktabs}
- \usepackage{multirow}
- \usepackage{pgfplots}
- \usepackage[noadjust]{cite}
- \usepackage[nameinlink,noabbrev]{cleveref} % has to be after hyperref, ntheorem, amsthm
- \usepackage[binary-units]{siunitx}
- \sisetup{per-mode=fraction,binary-units=true}
- \DeclareSIUnit\pixel{px}
- \usepackage{glossaries}
- \loadglsentries[main]{glossary}
- \makeglossaries
- \title{On-line Recognition of Handwritten Mathematical Symbols}
- \author{Martin Thoma and Kevin Kilgour}
- \hypersetup{
- pdfauthor = {Martin Thoma},
- pdfkeywords = {Mathematics,Symbols,recognition},
- pdftitle = {On-line Recognition of Handwritten Mathematical Symbols}
- }
- \include{variables}
- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- % Begin document %
- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- \begin{document}
- \maketitle
- \begin{abstract}
- Writing mathematical formulas with \LaTeX{} is easy as soon as one is used to
- commands like \verb+\alpha+ and \verb+\propto+. However, for people who have
- never used \LaTeX{} or who don't know the English name of the command, it can
- be difficult to find the right command. Hence the automatic recognition of
- handwritten mathematical symbols is desirable. This paper presents a system
- which uses the pen trajectory to classify handwritten symbols. Five
- preprocessing steps, one data multiplication algorithm, five features and five
- variants for multilayer Perceptron training were evaluated using $\num{166898}$
- recordings which were collected with two crowdsourcing projects. The evaluation
- results of these 21~experiments were used to create an optimized recognizer
- which has a TOP-1 error of less than $\SI{17.5}{\percent}$ and a TOP-3 error of
- $\SI{4.0}{\percent}$. This is a relative improvement of $\SI{18.5}{\percent}$ for the
- TOP-1 error and $\SI{29.7}{\percent}$ for the TOP-3 error compared to the
- baseline system.
- \end{abstract}
- \section{Introduction}
- On-line recognition makes use of the pen trajectory. This means the data is
- given as groups of sequences of tuples $(x, y, t) \in \mathbb{R}^3$, where each
- group represents a stroke, $(x, y)$ is the position of the pen on a canvas and
- $t$ is the time. One handwritten symbol in the described format is called a
- \textit{recording}. One approach to classify recordings into symbol classes
- assigns a probability to each class given the data. The classifier can be
- evaluated by using recordings which were classified by humans and were not used
- to train the classifier. The set of those recordings is called \textit{test
- set}. The TOP-$n$ error is defined as the fraction of the symbols where
- the correct class was not within the top $n$ classes of the highest
- probability.
- Various systems for mathematical symbol recognition with on-line data have been
- described so far~\cite{Kosmala98,Mouchere2013}, but most of them have neither
- published their source code nor their data which makes it impossible to re-run
- experiments to compare different systems. This is unfortunate as the choice of
- symbols is crucial for the TOP-$n$ error and all systems used different symbol
- sets. For example, the symbols $o$, $O$, $\circ$ and $0$ are very similar and
- systems which know all those classes will certainly have a higher TOP-$n$ error
- than systems which only accept one of them.
- Daniel Kirsch describes in~\cite{Kirsch} a system called Detexify which uses
- time warping to classify on-line handwritten symbols and claims to achieve a
- TOP-3 error of less than $\SI{10}{\percent}$ for a set of $\num{100}$~symbols.
- He also published his data on \url{https://github.com/kirel/detexify-data},
- which was collected by a crowdsourcing approach via
- \url{http://detexify.kirelabs.org}. Those recordings as well as some recordings
- which were collected by a similar approach via \url{http://write-math.com} were
- used to train and evaluated different classifiers. A complete description of
- all involved software, data and experiments is given in~\cite{Thoma:2014}.
- \section{Steps in Handwriting Recognition}
- The following steps are used in many classifiers:
- \begin{enumerate}
- \item \textbf{Preprocessing}: Recorded data is never perfect. Devices have
- errors and people make mistakes while using the devices. To tackle
- these problems there are preprocessing algorithms to clean the data.
- The preprocessing algorithms can also remove unnecessary variations
- of the data that do not help in the classification process, but hide
- what is important. Having slightly different sizes of the same symbol
- is an example of such a variation. Four preprocessing algorithms that
- clean or normalize recordings are explained in
- \cref{sec:preprocessing}.
- \item \textbf{Data multiplication}: Learning algorithms need lots of data
- to learn internal parameters. If there is not enough data available,
- domain knowledge can be considered to create new artificial data from
- the original data. In the domain of on-line handwriting recognition,
- data can be multiplied by adding rotated variants.
- \item \textbf{Segmentation}: The task of formula recognition can eventually
- be reduced to the task of symbol recognition combined with symbol
- placement. Before symbol recognition can be done, the formula has
- to be segmented. As this paper is only about single-symbol
- recognition, this step will not be further discussed.
- \item \textbf{Feature computation}: A feature is high-level information
- derived from the raw data after preprocessing. Some systems like
- Detexify simply take the result of the preprocessing step, but many
- compute new features. This might have the advantage that less
- training data is needed since the developer can use knowledge about
- handwriting to compute highly discriminative features. Various
- features are explained in \cref{sec:features}.
- \item \textbf{Feature enhancement}: Applying PCA, LDA, or
- feature standardization might change the features in ways that could
- improve the performance of learning algorithms.
- \end{enumerate}
- After these steps, we are faced with a classification learning task which
- consists of two parts:
- \begin{enumerate}
- \item \textbf{Learning} parameters for a given classifier. This process is
- also called \textit{training}.
- \item \textbf{Classifying} new recordings, sometimes called
- \textit{evaluation}. This should not be confused with the evaluation
- of the classification performance which is done for multiple
- topologies, preprocessing queues, and features in
- \Cref{ch:Evaluation}.
- \end{enumerate}
- The classification learning task can be solved with \glspl{MLP} if the number
- of input features is the same for every recording. There are many ways how to
- adjust \glspl{MLP} and how to adjust their training. Some of them are
- described in~\cref{sec:mlp-training}.
- \section{Algorithms}
- \subsection{Preprocessing}\label{sec:preprocessing}
- Preprocessing in symbol recognition is done to improve the quality and
- expressive power of the data. It should make follow-up tasks like segmentation
- and feature extraction easier, more effective or faster. It does so by resolving
- errors in the input data, reducing duplicate information and removing irrelevant
- information.
- Preprocessing algorithms fall into two groups: Normalization and noise
- reduction algorithms.
- A very important normalization algorithm in single-symbol recognition is
- \textit{scale-and-shift}~\cite{Thoma:2014}. It scales the recording so that
- its bounding box fits into a unit square. As the aspect ratio of a recording
- is almost never 1:1, only one dimension will fit exactly in the unit square.
- There are multiple ways how to shift the recording. For this paper, it was
- chosen to shift the bigger dimension to fit into the $[0,1] \times [0,1]$ unit
- square whereas the smaller dimension is centered in the $[-1,1] \times [-1,1]$
- square.
- Another normalization preprocessing algorithm is resampling. As the data points
- on the pen trajectory are generated asynchronously and with different
- time-resolutions depending on the used hardware and software, it is desirable
- to resample the recordings to have points spread equally in time for every
- recording. This was done by linear interpolation of the $(x,t)$ and $(y,t)$
- sequences and getting a fixed number of equally spaced points per stroke.
- \textit{Connect strokes} is a noise reduction algorithm. It happens sometimes
- that the hardware detects that the user lifted the pen where the user certainly
- didn't do so. This can be detected by measuring the Euclidean distance between
- the end of one stroke and the beginning of the next stroke. If this distance is
- below a threshold, then the strokes are connected.
- Due to a limited resolution of the recording device and due to erratic
- handwriting, the pen trajectory might not be smooth. One way to smooth is
- calculating a weighted average and replacing points by the weighted average of
- their coordinate and their neighbors coordinates. Another way to do smoothing
- would be to reduce the number of points with the Douglas-Peucker algorithm to
- the most relevant ones and then interpolate the stroke between those points.
- The Douglas-Peucker stroke simplification algorithm is usually used in
- cartography to simplify the shape of roads. It works recursively to find a
- subset of points of a stroke that is simpler and still similar to the original
- shape. The algorithm adds the first and the last point $p_1$ and $p_n$ of a
- stroke to the simplified set of points $S$. Then it searches the point $p_i$ in
- between that has maximum distance from the line $p_1 p_n$. If this
- distance is above a threshold $\varepsilon$, the point $p_i$ is added to $S$.
- Then the algorithm gets applied to $p_1 p_i$ and $p_i p_n$ recursively. It is
- described as \enquote{Algorithm 1} in~\cite{Visvalingam1990}.
- \subsection{Features}\label{sec:features}
- Features can be \textit{global}, that means calculated for the complete
- recording or complete strokes. Other features are calculated for single points
- on the pen trajectory and are called \textit{local}.
- Global features are the \textit{number of strokes} in a recording, the
- \textit{aspect ratio} of a recordings bounding box or the
- \textit{ink} being used for a recording. The ink feature gets calculated by
- measuring the length of all strokes combined. The re-curvature, which was
- introduced in~\cite{Huang06}, is defined as
- \[\text{re-curvature}(stroke) := \frac{\text{height}(stroke)}{\text{length}(stroke)}\]
- and a stroke-global feature.
- The simplest local feature is the coordinate of the point itself. Speed,
- curvature and a local small-resolution bitmap around the point, which was
- introduced by Manke, Finke and Waibel in~\cite{Manke1995}, are other local
- features.
- \subsection{Multilayer Perceptrons}\label{sec:mlp-training}
- \Glspl{MLP} are explained in detail in~\cite{Mitchell97}. They can have
- different numbers of hidden layers, the number of neurons per layer and the
- activation functions can be varied. The learning algorithm is parameterized by
- the learning rate $\eta \in (0, \infty)$, the momentum $\alpha \in [0, \infty)$
- and the number of epochs.
- The topology of \glspl{MLP} will be denoted in the following by separating the
- number of neurons per layer with colons. For example, the notation
- $160{:}500{:}500{:}500{:}369$ means that the input layer gets 160~features,
- there are three hidden layers with 500~neurons per layer and one output layer
- with 369~neurons.
- \glspl{MLP} training can be executed in various different ways, for example
- with \gls{SLP}. In case of a \gls{MLP} with the topology
- $160{:}500{:}500{:}500{:}369$, \gls{SLP} works as follows: At first a \gls{MLP}
- with one hidden layer ($160{:}500{:}369$) is trained. Then the output layer is
- discarded, a new hidden layer and a new output layer is added and it is trained
- again, resulting in a $160{:}500{:}500{:}369$ \gls{MLP}. The output layer is
- discarded again, a new hidden layer is added and a new output layer is added
- and the training is executed again.
- Denoising auto-encoders are another way of pretraining. An
- \textit{auto-encoder} is a neural network that is trained to restore its input.
- This means the number of input neurons is equal to the number of output
- neurons. The weights define an \textit{encoding} of the input that allows
- restoring the input. As the neural network finds the encoding by itself, it is
- called auto-encoder. If the hidden layer is smaller than the input layer, it
- can be used for dimensionality reduction~\cite{Hinton1989}. If only one hidden
- layer with linear activation functions is used, then the hidden layer contains
- the principal components after training~\cite{Duda2001}.
- Denoising auto-encoders are a variant introduced in~\cite{Vincent2008} that
- is more robust to partial corruption of the input features. It is trained to
- get robust by adding noise to the input features.
- There are multiple ways how noise can be added. Gaussian noise and
- randomly masking elements with zero are two possibilities. \cite{Deeplearning-Denoising-AE}
- describes how such a denoising auto-encoder with masking noise can be
- implemented. The \texttt{corruption} is the probability of a feature being
- masked.
- \section{Evaluation}\label{ch:Evaluation}
- In order to evaluate the effect of different preprocessing algorithms, features
- and adjustments in the \gls{MLP} training and topology, the following baseline
- system was used:
- Scale the recording to fit into a unit square while keeping the aspect ratio,
- shift it into $[-1,1] \times [-1,1]$ as described in \cref{sec:preprocessing},
- resample it with linear interpolation to get 20~points per stroke, spaced
- evenly in time. Take the first 4~strokes with 20~points per stroke and
- 2~coordinates per point as features, resulting in 160~features which is equal
- to the number of input neurons. If a recording has less than 4~strokes, the
- remaining features were filled with zeroes.
- All experiments were evaluated with four baseline systems $B_i$, $i \in \Set{1,
- 2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
- could have a severe influence on the effect of new features or preprocessing
- steps. Each hidden layer in all evaluated systems has $500$ neurons.
- Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
- of $\alpha = 0.1$. The activation function of every neuron in a hidden layer is
- the sigmoid function $\text{sig}(x) := \frac{1}{1+e^{-x}}$. The neurons in the
- output layer use the softmax function. For every experiment, exactly one part
- of the baseline systems was changed.
- \subsection{Random Weight Initialization}
- The neural networks in all experiments got initialized with a small random
- weight
- \[w_{i,j} \sim U(-4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}}, 4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}})\]
- where $w_{i,j}$ is the weight between the neurons $i$ and $j$, $l$ is the layer
- of neuron $i$, and $n_i$ is the number of neurons in layer $i$. This random
- initialization was suggested on
- \cite{deeplearningweights} and is done to break symmetry.
- This might lead to different error rates for the same systems just because the
- initialization was different.
- In order to get an impression of the magnitude of the influence on the different
- topologies and error rates the baseline models were trained 5 times with
- random initializations.
- \Cref{table:baseline-systems-random-initializations-summary}
- shows a summary of the results. The more hidden layers are used, the more do
- the results vary between different random weight initializations.
- \begin{table}[h]
- \centering
- \begin{tabular}{crrr|rrr} %chktex 44
- \toprule
- \multirow{3}{*}{System} & \multicolumn{6}{c}{Classification error}\\
- \cmidrule(l){2-7}
- & \multicolumn{3}{c}{TOP-1} & \multicolumn{3}{c}{TOP-3}\\
- & min & max & range & min & max & range\\\midrule
- $B_1$ & $\SI{23.08}{\percent}$ & $\SI{23.44}{\percent}$ & $\SI{0.36}{\percent}$ & $\SI{6.67}{\percent}$ & $\SI{6.80}{\percent}$ & $\SI{0.13}{\percent}$ \\
- $B_2$ & \underline{$\SI{21.45}{\percent}$} & \underline{$\SI{21.83}{\percent}$}& $\SI{0.38}{\percent}$ & $\SI{5.68}{\percent}$ & \underline{$\SI{5.75}{\percent}$} & $\SI{0.07}{\percent}$\\
- $B_3$ & $\SI{21.54}{\percent}$ & $\SI{22.28}{\percent}$ & $\SI{0.74}{\percent}$ & \underline{$\SI{5.50}{\percent}$} & $\SI{5.82}{\percent}$ & $\SI{0.32}{\percent}$\\
- $B_4$ & $\SI{23.19}{\percent}$ & $\SI{24.84}{\percent}$ & $\SI{1.65}{\percent}$ & $\SI{5.98}{\percent}$ & $\SI{6.44}{\percent}$ & $\SI{0.46}{\percent}$\\
- \bottomrule
- \end{tabular}
- \caption{The systems $B_1$ -- $B_4$ were randomly initialized, trained
- and evaluated 5~times to estimate the influence of random weight
- initialization.}
- \label{table:baseline-systems-random-initializations-summary}
- \end{table}
- \subsection{Connect strokes}
- In order to solve the problem of interrupted strokes, pairs of strokes
- can be connected with stroke connect algorithm. The idea is that for
- a pair of consecutively drawn strokes $s_{i}, s_{i+1}$ the last point $s_i$ is
- close to the first point of $s_{i+1}$ if a stroke was accidentally split
- into two strokes.
- $\SI{59}{\percent}$ of all stroke pair distances in the collected data are
- between $\SI{30}{\pixel}$ and $\SI{150}{\pixel}$. Hence the stroke connect
- algorithm was tried with $\SI{5}{\pixel}$, $\SI{10}{\pixel}$ and
- $\SI{20}{\pixel}$.
- All models TOP-3 error improved with a threshold of $\theta = \SI{10}{\pixel}$
- by at least $\SI{0.17}{\percent}$, except $B_4$ which improved only by
- $\SI{0.01}{\percent}$ which could be a result of random weight initialization.
- \subsection{Douglas-Peucker Smoothing}
- The Douglas-Peucker algorithm can be used to find
- points that are more relevant for the overall shape of a recording. After that,
- an interpolation can be done. If the interpolation is a cubic spline
- interpolation, this makes the recording smooth.
- The Douglas-Peucker algorithm was applied with a threshold of $\varepsilon =
- 0.05$, $\varepsilon = 0.1$ and $\varepsilon = 0.2$ after scaling and shifting,
- but before resampling. The interpolation in the resampling step was done
- linearly and with cubic splines in two experiments. The recording was scaled
- and shifted again after the interpolation because the bounding box might have
- changed.
- The result of the application of the Douglas-Peucker smoothing with $\varepsilon
- > 0.05$ was a high rise of the TOP-1 and TOP-3 error for all models $B_i$.
- This means that the simplification process removes some relevant information and
- does not --- as it was expected --- remove only noise. For $\varepsilon = 0.05$
- with linear interpolation some models TOP-1 error improved, but the
- changes were small. It could be an effect of random weight initialization.
- However, cubic spline interpolation made all systems perform more than
- $\SI{1.7}{\percent}$ worse for TOP-1 and TOP-3 error.
- The lower the value of $\varepsilon$, the less does the recording change after
- this preprocessing step. As it was applied after scaling the recording such that
- the biggest dimension of the recording (width or height) is $1$, a value of
- $\varepsilon = 0.05$ means that a point has to move at least $\SI{5}{\percent}$
- of the biggest dimension.
- \subsection{Global Features}
- Single global features were added one at a time to the baseline systems. Those
- features were re-curvature $\text{re-curvature}(stroke) = \frac{\text{height}(stroke)}{\text{length}(stroke)}$
- as described in \cite{Huang06}, the ink feature which is the summed length
- of all strokes, the stroke count, the aspect ratio and the stroke center points
- for the first four strokes. The stroke center point feature improved the system
- $B_1$ by $\SI{0.27}{\percent}$ for the TOP-3 error and system $B_3$ for the
- TOP-1 error by $\SI{0.74}{\percent}$, but all other systems and error measures
- either got worse or did not improve much.
- The other global features did improve the systems $B_1 -- B_3$, but not $B_4$.
- The highest improvement was achieved with the re-curvature feature. It
- improved the systems $B_1 -- B_4$ by more than $\SI{0.6}{\percent}$ TOP-1 error.
- \subsection{Data Multiplication}
- Data multiplication can be used to make the model invariant to transformations.
- However, this idea seems not to work well in the domain of on-line handwritten
- mathematical symbols. It was tried to triple the data by adding a rotated
- version that is rotated 3 degrees to the left and another one that is rotated
- 3 degrees to the right around the center of mass. This data multiplication
- made all classifiers for most error measures perform worse by more than
- $\SI{2}{\percent}$ for the TOP-1 error.
- \subsection{Pretraining}\label{subsec:pretraining-evaluation}
- Pretraining is a technique used to improve the training of \glspl{MLP} with
- multiple hidden layers.
- \Cref{fig:training-and-test-error-for-different-topologies-pretraining} shows
- the evolution of the TOP-1 error over 1000~epochs with supervised
- layer-wise pretraining and without pretraining. It clearly shows that this
- kind of pretraining improves the classification performance by $\SI{1.6}{\percent}$
- for the TOP-1 error and $\SI{1.0}{\percent}$ for the TOP-3 error.
- \begin{figure}[htb]
- \centering
- \input{figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex}
- \caption{Training- and test error by number of trained epochs for different
- topologies with \gls{SLP}. The plot shows
- that all pretrained systems performed much better than the systems
- without pretraining. All plotted systems did not improve
- with more epochs of training.}
- \label{fig:training-and-test-error-for-different-topologies-pretraining}
- \end{figure}
- Pretraining with denoising auto-encoder lead to the much worse results listed in
- \cref{table:pretraining-denoising-auto-encoder}. The first layer used a $\tanh$
- activation function. Every layer was trained for $1000$ epochs and the
- \gls{MSE} loss function. A learning-rate of $\eta = 0.001$, a corruption of
- $0.3$ and a $L_2$ regularization of $\lambda = 10^{-4}$ were chosen. This
- pretraining setup made all systems with all error measures perform much worse.
- \begin{table}[tb]
- \centering
- \begin{tabular}{lrrrr}
- \toprule
- \multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
- \cmidrule(l){2-5}
- & TOP-1 & change & TOP-3 & change \\\midrule
- $B_{1,p}$ & $\SI{23.75}{\percent}$ & $\SI{+0.41}{\percent}$ & $\SI{7.19}{\percent}$ & $\SI{+0.39}{\percent}$\\
- $B_{2,p}$ & \underline{$\SI{22.76}{\percent}$} & $\SI{+1.25}{\percent}$ & $\SI{6.38}{\percent}$ & $\SI{+0.63}{\percent}$\\
- $B_{3,p}$ & $\SI{23.10}{\percent}$ & $\SI{+1.17}{\percent}$ & \underline{$\SI{6.14}{\percent}$} & $\SI{+0.40}{\percent}$\\
- $B_{4,p}$ & $\SI{25.59}{\percent}$ & $\SI{+1.71}{\percent}$ & $\SI{6.99}{\percent}$ & $\SI{+0.87}{\percent}$\\
- \bottomrule
- \end{tabular}
- \caption{Systems with denoising auto-encoder pretraining compared to pure
- gradient descent. The pretrained systems clearly performed worse.}
- \label{table:pretraining-denoising-auto-encoder}
- \end{table}
- \subsection{Optimized Recognizer}
- All preprocessing steps and features that were useful were combined to
- create a recognizer that should perform best.
- All models were much better than everything that was tried before. The results
- of this experiment show that single-symbol recognition with
- \totalClassesAnalyzed{} classes and usual touch devices and the mouse can be
- done with a TOP1 error rate of $\SI{18.56}{\percent}$ and a TOP3 error of
- $\SI{4.11}{\percent}$. This was
- achieved by a \gls{MLP} with a $167{:}500{:}500{:}\totalClassesAnalyzed{}$ topology.
- It used an algorithm to connect strokes of which the ends were less than
- $\SI{10}{\pixel}$ away, scaled each recording to a unit square and shifted this
- unit square to $(0,0)$. After that, a linear resampling step was applied to the
- first 4 strokes to resample them to 20 points each. All other strokes were
- discarded.
- The 167 features were
- \begin{itemize}
- \item the first 4 strokes with 20 points per stroke resulting in 160
- features,
- \item the re-curvature for the first 4 strokes,
- \item the ink,
- \item the number of strokes and
- \item the aspect ratio
- \end{itemize}
- \Gls{SLP} was applied with $\num{1000}$ epochs per layer, a
- learning rate of $\eta=0.1$ and a momentum of $\alpha=0.1$. After that, the
- complete model was trained again for $1000$ epochs with standard mini-batch
- gradient descent.
- After the models $B_{1,c}$ -- $B_{4,c}$ were trained the first $1000$ epochs,
- they were trained again for $1000$ epochs with a learning rate of $\eta = 0.05$.
- \Cref{table:complex-recognizer-systems-evaluation} shows that
- this improved the classifiers again.
- \begin{table}[htb]
- \centering
- \begin{tabular}{lrrrr}
- \toprule
- \multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
- \cmidrule(l){2-5}
- & TOP1 & change & TOP3 & change\\\midrule
- $B_{1,c}$ & $\SI{20.96}{\percent}$ & $\SI{-2.38}{\percent}$ & $\SI{5.24}{\percent}$ & $\SI{-1.56}{\percent}$\\
- $B_{2,c}$ & $\SI{18.26}{\percent}$ & $\SI{-3.25}{\percent}$ & $\SI{4.07}{\percent}$ & $\SI{-1.68}{\percent}$\\
- $B_{3,c}$ & \underline{$\SI{18.19}{\percent}$} & $\SI{-3.74}{\percent}$ & \underline{$\SI{4.06}{\percent}$} & $\SI{-1.68}{\percent}$\\
- $B_{4,c}$ & $\SI{18.57}{\percent}$ & $\SI{-5.31}{\percent}$ & $\SI{4.25}{\percent}$ & $\SI{-1.87}{\percent}$\\\midrule
- $B_{1,c}'$ & $\SI{19.33}{\percent}$ & $\SI{-1.63}{\percent}$ & $\SI{4.78}{\percent}$ & $\SI{-0.46}{\percent}$ \\
- $B_{2,c}'$ & \underline{$\SI{17.52}{\percent}$} & $\SI{-0.74}{\percent}$ & \underline{$\SI{4.04}{\percent}$} & $\SI{-0.03}{\percent}$\\
- $B_{3,c}'$ & $\SI{17.65}{\percent}$ & $\SI{-0.54}{\percent}$ & $\SI{4.07}{\percent}$ & $\SI{+0.01}{\percent}$\\
- $B_{4,c}'$ & $\SI{17.82}{\percent}$ & $\SI{-0.75}{\percent}$ & $\SI{4.26}{\percent}$ & $\SI{+0.01}{\percent}$\\
- \bottomrule
- \end{tabular}
- \caption{Error rates of the optimized recognizer systems. The systems
- $B_{i,c}'$ were trained another $1000$ epochs with a learning rate
- of $\eta=0.05$. The value of the column \enquote{change} of the
- systems $B_{i,c}'$ is relative to $B_{i,c}$.}
- \label{table:complex-recognizer-systems-evaluation}
- \end{table}
- \section{Conclusion}
- Four baseline recognition systems were adjusted in many experiments and their
- recognition capabilities were compared in order to build a recognition system
- that can recognize 396 mathematical symbols with low error rates as well as to
- evaluate which preprocessing steps and features help to improve the recognition
- rate.
- All recognition systems were trained and evaluated with
- $\num{\totalCollectedRecordings{}}$ recordings for \totalClassesAnalyzed{}
- symbols. These recordings were collected by two crowdsourcing projects
- (\href{http://detexify.kirelabs.org/classify.html}{Detexify} and
- \href{write-math.com}{write-math.com}) and created with various devices. While
- some recordings were created with standard touch devices such as tablets and
- smartphones, others were created with the mouse.
- \Glspl{MLP} were used for the classification task. Four baseline systems with
- different numbers of hidden layers were used, as the number of hidden layer
- influences the capabilities and problems of \glspl{MLP}.
- All baseline systems used the same preprocessing queue. The recordings were
- scaled to fit into a unit square, shifted to $(0,0)$, resampled with linear
- interpolation so that every stroke had exactly 20~points which are spread
- equidistant in time. The 80~($x,y$) coordinates of the first 4~strokes were used
- to get exactly $160$ input features for every recording. The baseline system
- $B_2$ has a TOP-3 error of $\SI{5.75}{\percent}$.
- Adding two slightly rotated variants for each recording and hence tripling the
- training set made the systems $B_3$ and $B_4$ perform much worse, but improved
- the performance of the smaller systems.
- The global features re-curvature, ink, stoke count and aspect ratio improved the
- systems $B_1$--$B_3$, whereas the stroke center point feature made $B_2$ perform
- worse.
- Denoising auto-encoders were evaluated as one way
- to use pretraining, but by this the error rate increased notably. However,
- supervised layer-wise pretraining improved the performance decidedly.
- The stroke connect algorithm was added to the preprocessing steps of the
- baseline system as well as the re-curvature feature, the ink feature, the number
- of strokes and the aspect ratio. The training setup of the baseline system was
- changed to supervised layer-wise pretraining and the resulting model was trained
- with a lower learning rate again. This optimized recognizer $B_{2,c}'$ had a TOP-3
- error of $\SI{4.04}{\percent}$. This means that the TOP-3 error dropped by over
- $\SI{1.7}{\percent}$ in comparison to the baseline system $B_2$.
- A TOP-3 error of $\SI{4.04}{\percent}$ makes the system usable for symbol lookup.
- It could also be used as a starting point for the development of a
- multiple-symbol classifier.
- The aim of this work was to develop a symbol recognition system which is easy
- to use, fast and has high recognition rates as well as evaluating ideas for
- single symbol classifiers. Some of those goals were reached. The recognition
- system $B_{2,c}'$ evaluates new recordings in a fraction of a second and has
- acceptable recognition rates. Many algorithms were evaluated.
- However, there are still many other algorithms which could be evaluated and, at
- the time of this work, the best classifier $B_{2,c}'$ is only available
- through the Python package \texttt{hwrt}. It is planned to add an web version
- of that classifier online.
- \bibliographystyle{IEEEtranSA}
- \bibliography{write-math-ba-paper}
- \end{document}
|