| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314 |
- \documentclass[9pt,technote]{IEEEtran}
- \usepackage{amssymb, amsmath} % needed for math
- \usepackage{hyperref} % links im text
- \usepackage{parskip}
- \usepackage{csquotes}
- \usepackage{braket}
- \usepackage[noadjust]{cite}
- \usepackage[nameinlink,noabbrev]{cleveref} % has to be after hyperref, ntheorem, amsthm
- \usepackage[binary-units]{siunitx}
- \sisetup{per-mode=fraction,binary-units=true}
- \DeclareSIUnit\pixel{px}
- \usepackage{glossaries}
- \loadglsentries[main]{glossary}
- \makeglossaries
- \title{On-line Recognition of Handwritten Mathematical Symbols}
- \author{Martin Thoma}
- \hypersetup{
- pdfauthor = {Martin Thoma},
- pdfkeywords = {Mathematics,Symbols,recognition},
- pdftitle = {On-line Recognition of Handwritten Mathematical Symbols}
- }
- \include{variables}
- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- % Begin document %
- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- \begin{document}
- \maketitle
- \begin{abstract}
- Writing mathematical formulas with \LaTeX{} is easy as soon as one is used to
- commands like \verb+\alpha+ and \verb+\propto+. However, for people who have
- never used \LaTeX{} or who don't know the English name of the command, it can
- be difficult to find the right command. Hence the automatic recognition of
- handwritten mathematical symbols is desirable. This paper presents a system
- which uses the pen trajectory to classify handwritten symbols. Five
- preprocessing steps, one data multiplication algorithm, five features and five
- variants for multilayer Perceptron training were evaluated using $\num{166898}$
- recordings which were collected with two crowdsourcing projects. The evaluation
- results of these 21~experiments were used to create an optimized recognizer
- which has a TOP1 error of less than $\SI{17.5}{\percent}$ and a TOP3 error of
- $\SI{4.0}{\percent}$. This is an improvement of $\SI{18.5}{\percent}$ for the
- TOP1 error and $\SI{29.7}{\percent}$ for the TOP3 error compared to the
- baseline system.
- \end{abstract}
- \section{Introduction}
- On-line recognition makes use of the pen trajectory. This means the data is
- given as groups of sequences of tuples $(x, y, t) \in \mathbb{R}^3$, where
- each group represents a stroke, $(x, y)$ is the position of the pen on a canvas
- and $t$ is the time. One handwritten symbol in the described format is called
- a \textit{recording}. Recordings can be classified by making use of
- this data. One classification approach assigns a probability to each class
- given the data. The classifier can be evaluated by using recordings which
- were classified by humans and were not used by to train the classifier. The
- set of those recordings is called \textit{testset}. Then
- the TOP-$n$ error is defined as the fraction of the symbols where the correct
- class was not within the top $n$ classes of the highest probability.
- Various systems for mathematical symbol recognition with on-line data have been
- described so far~\cite{Kosmala98,Mouchere2013}, but most of them have neither
- published their source code nor their data which makes it impossible to re-run
- experiments to compare different systems. This is unfortunate as the choice of
- symbols is cruicial for the TOP-$n$ error. For example, the symbols $o$, $O$,
- $\circ$ and $0$ are very similar and systems which know all those classes will
- certainly have a higher TOP-$n$ error than systems which only accept one of
- them.
- Daniel Kirsch describes in~\cite{Kirsch} a system which uses time warping to
- classify on-line handwritten symbols and claimes to achieve a TOP3 error of
- less than $\SI{10}{\percent}$ for a set of $\num{100}$~symbols. He also
- published his data, which was collected by a crowd-sourcing approach via
- \url{http://detexify.kirelabs.org}, on
- \url{https://github.com/kirel/detexify-data}. Those recordings as well as
- some recordings which were collected by a similar approach via
- \url{http://write-math.com} were used to train and evaluated different
- classifiers. A complete description of all involved software, data,
- presentations and experiments is listed in~\cite{Thoma:2014}.
- \section{Steps in Handwriting Recognition}
- The following steps are used in all classifiers which are described in the
- following:
- \begin{enumerate}
- \item \textbf{Preprocessing}: Recorded data is never perfect. Devices have
- errors and people make mistakes while using devices. To tackle
- these problems there are preprocessing algorithms to clean the data.
- The preprocessing algorithms can also remove unnecessary variations of
- the data that do not help classify but hide what is important.
- Having slightly different sizes of the same symbol is an example of such a
- variation. Nine preprocessing algorithms that clean or normalize
- recordings are explained in
- \cref{sec:preprocessing}.
- \item \textbf{Data multiplication}: Learning algorithms need lots of data
- to learn internal parameters. If there is not enough data available,
- domain knowledge can be considered to create new artificial data from
- the original data. In the domain of on-line handwriting recognition
- data can be multiplied by adding rotated variants.
- \item \textbf{Segmentation}: The task of formula recognition can eventually
- be reduced to the task of symbol recognition combined with symbol
- placement. Before symbol recognition can be done, the formula has
- to be segmented. As this paper is only about single-symbol
- recognition, this step will not be further discussed.
- \item \textbf{Feature computation}: A feature is high-level information
- derived from the raw data after preprocessing. Some systems like
- Detexify, which was presented in~\cite{Kirsch}, simply take the
- result of the preprocessing step, but many compute new features. This
- might have the advantage that less training data is needed since the
- developer can use knowledge about handwriting to compute highly
- discriminative features. Various features are explained in
- \cref{sec:features}.
- \item \textbf{Feature enhancement}: Applying PCA, LDA, or
- feature standardization might change the features in ways that could
- improve the performance of learning algorithms.
- \end{enumerate}
- After these steps, we are faced with a classification learning task which consists of
- two parts:
- \begin{enumerate}
- \item \textbf{Learning} parameters for a given classifier. This process is
- also called \textit{training}.
- \item \textbf{Classifying} new recordings, sometimes called
- \textit{evaluation}. This should not be confused with the evaluation
- of the classification performance which is done for multiple
- topologies, preprocessing queues, and features in \Cref{ch:Evaluation}.
- \end{enumerate}
- Two fundamentally different systems for classification of time series data were
- evaluated. One uses greedy time warping, which has a very easy, fast learning
- algorithm which only stores some of the seen training examples. The other one is
- based on neural networks, taking longer to train, but is much faster in
- recognition and also leads to better recognition results.
- \section{Algorithms}
- \subsection{Preprocessing}\label{sec:preprocessing}
- Preprocessing in symbol recognition is done to improve the quality and
- expressive power of the data. It should make follow-up tasks like segmentation
- and feature extraction easier, more effective or faster. It does so by resolving
- errors in the input data, reducing duplicate information and removing irrelevant
- information.
- The preprocessing algorithms fall in two groups: Normalization and noise
- reduction algorithms.
- The most important normalization algorithm in single-symbol recognition is
- \textit{scale-and-shift}. It scales the recording so that
- its bounding box fits into a unit square. As the aspect ratio of a recording
- is almost never 1:1, only one dimension will fit exactly in the unit square.
- Then there are multiple ways how to shift the recording. For this paper, it was
- chosen to shift the bigger dimension to fit into the $[0,1] \times [0,1]$ unit
- square whereas the smaller dimension is centered in the $[-1,1] \times [-1,1]$
- square.
- Another normalization preprocessing algorithm is resampling. As the data points
- on the pen trajectory are generated asynchronously and with different
- time-resolutions depending on the used hardware and software, it is desirable
- to resample the recordings to have points spread equally in time for every
- recording. This was done with linear interpolation of the $(x,t)$ and $(y,t)$
- sequences and getting a fixed number of equally spaced samples.
- \textit{Connect strokes} is a noise reduction algorithm. It happens sometimes
- that the hardware detects that the user lifted the pen where he certainly
- didn't do so. This can be detected by measuring the distance between the end of
- one stroke and the beginning of the next stroke. If this distance is below a
- threshold, then the strokes are connected.
- Due to a limited resolution of the recording device and due to erratic
- handwriting, the pen trajectory might not be smooth. One way to smooth is
- calculating a weighted average and replacing points by the weighted average of
- their coordinate and their neighbors coordinates. Another way to do smoothing
- would be to reduce the number of points with the Douglas-Peucker algorithm to
- the most relevant ones and then interpolate those points. The Douglas-Peucker
- stroke simplification algorithm is usually used in cartography to simplify the
- shape of roads. The Douglas-Peucker algorithm works recursively to find a
- subset of control points of a stroke that is simpler and still similar to the
- original shape. The algorithm adds the first and the last point $p_1$ and $p_n$
- of a stroke to the simplified set of points $S$. Then it searches the control
- point $p_i$ in between that has maximum distance from the \gls{line} $p_1 p_n$.
- If this distance is above a threshold $\varepsilon$, the point $p_i$ is added
- to $S$. Then the algorithm gets applied to $p_1 p_i$ and $p_i p_n$ recursively.
- Pseudocode of this algorithm is on \cpageref{alg:douglas-peucker}. It is
- described as \enquote{Algorithm 1} in~\cite{Visvalingam1990} with a different
- notation.
- \subsection{Features}\label{sec:features}
- Features can be global, that means calculated for the complete recording or
- complete strokes. Other features are calculated for single points on the
- pen trajectory and are called \textit{local}.
- Global features are the \textit{number of strokes} in a recording, the
- \textit{aspect ratio} of the bounding box of a recordings bounding box or the
- \textit{ink} being used for a recording. The ink feature gets calculated by
- measuring the length of all strokes combined. The re-curvature, which was
- introduced in~\cite{Huang06}, is defined as
- \[\text{re-curvature}(stroke) := \frac{\text{height}(stroke)}{\text{length}(stroke)}\]
- and a stroke-global feature.
- The most important local feature is the coordinate of the point itself.
- Speed, curvature and a local small-resolution bitmap around the point, which
- was introduced by Manke et al. in~\cite{Manke94} are other local features.
- \subsection{Multilayer Perceptrons}\label{sec:mlp-training}
- \Glspl{MLP} are explained in detail in~\cite{Mitchell97}. They can have
- different numbers of hidden layers, the number of neurons per layer and the
- activation functions can be varied. The learning algorithm is parameterized by
- the learning rate $\eta$, the momentum $\alpha$ and the number of epochs. The
- learning of \glspl{MLP} can be executed in various different ways, for example
- with layer-wise supversided pretraining which means if a three layer \gls{MLP}
- of the topology $160:500:500:500:369$ should get trained, at first a \gls{MLP}
- with one hidden layer ($160:500:369$) is trained. Then the output layer is
- discarded, a new hidden layer and a new output layer is added and it is trained
- again. Then we have a $160:500:500:369$ \gls{MLP}. The output layer is
- discarded again, a new hidden layer is added and a new output layer is added
- and the training is executed again.
- \section{Evaluation}\label{ch:Evaluation}
- In order to evaluate the effect of different preprocessing algorithms, features
- and adjustments in the \gls{MLP} training and topology, the following baseline
- system was used:
- Scale the recording to fit into a unit square while keeping the aspect ratio,
- shift it into $[-1,1] \times [-1,1]$ as described in \cref{sec:preprocessing},
- resample it with linear interpolation to get 20~points per stroke, spaced
- evenly in time. Take the first 4~strokes with 20~points per stroke and
- 2~coordinates per point as features, resulting in 160~features which is equal
- to the number of input neurons. If a recording has less than 4~strokes, the
- remaining features were filled with zeroes.
- All experiments were evaluated with four baseline systems $B_i$, $i \in \Set{1,
- 2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
- could have a severe influence on the effect of new features or preprocessing
- steps. Each hidden layer in all evaluated systems has $500$ neurons.
- Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
- of $\alpha = 0.1$. The activation function of every neuron is
- %TODO: Evaluation randomnes
- %TODO:
- \section{Conclusion}
- The aim of this bachelor's thesis was to build a recognition system that
- can recognize many mathematical symbols with low error rates as well as to
- evaluate which preprocessing steps and features help to improve the recognition
- rate.
- All recognition systems were trained and evaluated with
- $\num{\totalCollectedRecordings{}}$ recordings for \totalClassesAnalyzed{}
- symbols. These recordings were collected by two crowdsourcing projects
- (\href{http://detexify.kirelabs.org/classify.html}{Detexify} and
- \href{write-math.com}{write-math.com}) and created with various devices. While
- some recordings were created with standard touch devices such as tablets and
- smartphones, others were created with the mouse.
- \Glspl{MLP} were used for the classification task. Four baseline systems with
- different numbers of hidden layers were used, as the number of hidden layer
- influences the capabilities and problems of \glspl{MLP}. Furthermore, an error
- measure MER was defined, which takes the top three \glspl{hypothesis} of the classifier,
- merges symbols such as \verb+\sum+ ($\sum$) and \verb+\Sigma+ ($\Sigma$) to
- equivalence classes, and then calculates the error.
- All baseline systems used the same preprocessing queue. The recordings were
- scaled to fit into a unit square, shifted to $(0,0)$, resampled with linear
- interpolation so that every stroke had exactly 20~points which are spread
- equidistant in time. The 80~($x,y$) coordinates of the first 4~strokes were used
- to get exactly $160$ input features for every recording. The baseline systems
- $B_2$ has a MER error of $\SI{5.67}{\percent}$.
- Three variations of the scale and shift algorithm, wild point filtering, stroke
- connect, weighted average smoothing, and Douglas-Peucker smoothing were
- evaluated. The evaluation showed that the scale and shift algorithm is extremely
- important and the connect strokes algorithm improves the classification. All
- other preprocessing algorithms either diminished the classification performance
- or had less influence on it than the random initialization of the \glspl{MLP}
- weights.
- Adding two slightly rotated variants for each recording and hence tripling the
- training set made the systems $B_3$ and $B_4$ perform much worse, but improved
- the performance of the smaller systems.
- The global features re-curvature, ink, stoke count and aspect ratio improved the
- systems $B_1$--$B_3$, whereas the stroke center point feature made $B_2$ perform
- worse.
- The learning rate and the momentum were evaluated. A learning rate of $\eta=0.1$
- and a momentum of $\alpha=0.9$ gave the best results. Newbob training lead to
- much worse recognition rates. Denoising auto-encoders were evaluated as one way
- to use pretraining, but by this the error rate increased notably. However,
- supervised layer-wise pretraining improved the performance decidedly.
- The stroke connect algorithm was added to the preprocessing steps of the
- baseline system as well as the re-curvature feature, the ink feature, the number
- of strokes and the aspect ratio. The training setup of the baseline system was
- changed to supervised layer-wise pretraining and the resulting model was trained
- with a lower learning rate again. This optimized recognizer $B_{2,c}'$ had a MER
- error of $\SI{3.96}{\percent}$. This means that the MER error dropped by over
- $\SI{30}{\percent}$ in comparison to the baseline system $B_2$.
- A MER error of $\SI{3.96}{\percent}$ makes the system usable for symbol lookup.
- It could also be used as a starting point for the development of a
- multiple-symbol classifier.
- The aim of this bachelor's thesis was to develop a symbol recognition system
- which is easy to use, fast and has high recognition rates as well as evaluating
- ideas for single symbol classifiers. Some of those goals were reached. The
- recognition system $B_{2,c}'$ evaluates new recordings in a fraction of a second
- and has acceptable recognition rates. Many variations algorithms were evaluated.
- However, there are still many more algorithms which could be evaluated and, at
- the time of this work, the best classifier $B_{2,c}'$ is not publicly available.
- \bibliographystyle{IEEEtranSA}
- \bibliography{write-math-ba-paper}
- \end{document}
|