write-math-ba-paper.tex 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314
  1. \documentclass[9pt,technote]{IEEEtran}
  2. \usepackage{amssymb, amsmath} % needed for math
  3. \usepackage{hyperref} % links im text
  4. \usepackage{parskip}
  5. \usepackage{csquotes}
  6. \usepackage{braket}
  7. \usepackage[noadjust]{cite}
  8. \usepackage[nameinlink,noabbrev]{cleveref} % has to be after hyperref, ntheorem, amsthm
  9. \usepackage[binary-units]{siunitx}
  10. \sisetup{per-mode=fraction,binary-units=true}
  11. \DeclareSIUnit\pixel{px}
  12. \usepackage{glossaries}
  13. \loadglsentries[main]{glossary}
  14. \makeglossaries
  15. \title{On-line Recognition of Handwritten Mathematical Symbols}
  16. \author{Martin Thoma}
  17. \hypersetup{
  18. pdfauthor = {Martin Thoma},
  19. pdfkeywords = {Mathematics,Symbols,recognition},
  20. pdftitle = {On-line Recognition of Handwritten Mathematical Symbols}
  21. }
  22. \include{variables}
  23. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  24. % Begin document %
  25. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  26. \begin{document}
  27. \maketitle
  28. \begin{abstract}
  29. Writing mathematical formulas with \LaTeX{} is easy as soon as one is used to
  30. commands like \verb+\alpha+ and \verb+\propto+. However, for people who have
  31. never used \LaTeX{} or who don't know the English name of the command, it can
  32. be difficult to find the right command. Hence the automatic recognition of
  33. handwritten mathematical symbols is desirable. This paper presents a system
  34. which uses the pen trajectory to classify handwritten symbols. Five
  35. preprocessing steps, one data multiplication algorithm, five features and five
  36. variants for multilayer Perceptron training were evaluated using $\num{166898}$
  37. recordings which were collected with two crowdsourcing projects. The evaluation
  38. results of these 21~experiments were used to create an optimized recognizer
  39. which has a TOP1 error of less than $\SI{17.5}{\percent}$ and a TOP3 error of
  40. $\SI{4.0}{\percent}$. This is an improvement of $\SI{18.5}{\percent}$ for the
  41. TOP1 error and $\SI{29.7}{\percent}$ for the TOP3 error compared to the
  42. baseline system.
  43. \end{abstract}
  44. \section{Introduction}
  45. On-line recognition makes use of the pen trajectory. This means the data is
  46. given as groups of sequences of tuples $(x, y, t) \in \mathbb{R}^3$, where
  47. each group represents a stroke, $(x, y)$ is the position of the pen on a canvas
  48. and $t$ is the time. One handwritten symbol in the described format is called
  49. a \textit{recording}. Recordings can be classified by making use of
  50. this data. One classification approach assigns a probability to each class
  51. given the data. The classifier can be evaluated by using recordings which
  52. were classified by humans and were not used by to train the classifier. The
  53. set of those recordings is called \textit{testset}. Then
  54. the TOP-$n$ error is defined as the fraction of the symbols where the correct
  55. class was not within the top $n$ classes of the highest probability.
  56. Various systems for mathematical symbol recognition with on-line data have been
  57. described so far~\cite{Kosmala98,Mouchere2013}, but most of them have neither
  58. published their source code nor their data which makes it impossible to re-run
  59. experiments to compare different systems. This is unfortunate as the choice of
  60. symbols is cruicial for the TOP-$n$ error. For example, the symbols $o$, $O$,
  61. $\circ$ and $0$ are very similar and systems which know all those classes will
  62. certainly have a higher TOP-$n$ error than systems which only accept one of
  63. them.
  64. Daniel Kirsch describes in~\cite{Kirsch} a system which uses time warping to
  65. classify on-line handwritten symbols and claimes to achieve a TOP3 error of
  66. less than $\SI{10}{\percent}$ for a set of $\num{100}$~symbols. He also
  67. published his data, which was collected by a crowd-sourcing approach via
  68. \url{http://detexify.kirelabs.org}, on
  69. \url{https://github.com/kirel/detexify-data}. Those recordings as well as
  70. some recordings which were collected by a similar approach via
  71. \url{http://write-math.com} were used to train and evaluated different
  72. classifiers. A complete description of all involved software, data,
  73. presentations and experiments is listed in~\cite{Thoma:2014}.
  74. \section{Steps in Handwriting Recognition}
  75. The following steps are used in all classifiers which are described in the
  76. following:
  77. \begin{enumerate}
  78. \item \textbf{Preprocessing}: Recorded data is never perfect. Devices have
  79. errors and people make mistakes while using devices. To tackle
  80. these problems there are preprocessing algorithms to clean the data.
  81. The preprocessing algorithms can also remove unnecessary variations of
  82. the data that do not help classify but hide what is important.
  83. Having slightly different sizes of the same symbol is an example of such a
  84. variation. Nine preprocessing algorithms that clean or normalize
  85. recordings are explained in
  86. \cref{sec:preprocessing}.
  87. \item \textbf{Data multiplication}: Learning algorithms need lots of data
  88. to learn internal parameters. If there is not enough data available,
  89. domain knowledge can be considered to create new artificial data from
  90. the original data. In the domain of on-line handwriting recognition
  91. data can be multiplied by adding rotated variants.
  92. \item \textbf{Segmentation}: The task of formula recognition can eventually
  93. be reduced to the task of symbol recognition combined with symbol
  94. placement. Before symbol recognition can be done, the formula has
  95. to be segmented. As this paper is only about single-symbol
  96. recognition, this step will not be further discussed.
  97. \item \textbf{Feature computation}: A feature is high-level information
  98. derived from the raw data after preprocessing. Some systems like
  99. Detexify, which was presented in~\cite{Kirsch}, simply take the
  100. result of the preprocessing step, but many compute new features. This
  101. might have the advantage that less training data is needed since the
  102. developer can use knowledge about handwriting to compute highly
  103. discriminative features. Various features are explained in
  104. \cref{sec:features}.
  105. \item \textbf{Feature enhancement}: Applying PCA, LDA, or
  106. feature standardization might change the features in ways that could
  107. improve the performance of learning algorithms.
  108. \end{enumerate}
  109. After these steps, we are faced with a classification learning task which consists of
  110. two parts:
  111. \begin{enumerate}
  112. \item \textbf{Learning} parameters for a given classifier. This process is
  113. also called \textit{training}.
  114. \item \textbf{Classifying} new recordings, sometimes called
  115. \textit{evaluation}. This should not be confused with the evaluation
  116. of the classification performance which is done for multiple
  117. topologies, preprocessing queues, and features in \Cref{ch:Evaluation}.
  118. \end{enumerate}
  119. Two fundamentally different systems for classification of time series data were
  120. evaluated. One uses greedy time warping, which has a very easy, fast learning
  121. algorithm which only stores some of the seen training examples. The other one is
  122. based on neural networks, taking longer to train, but is much faster in
  123. recognition and also leads to better recognition results.
  124. \section{Algorithms}
  125. \subsection{Preprocessing}\label{sec:preprocessing}
  126. Preprocessing in symbol recognition is done to improve the quality and
  127. expressive power of the data. It should make follow-up tasks like segmentation
  128. and feature extraction easier, more effective or faster. It does so by resolving
  129. errors in the input data, reducing duplicate information and removing irrelevant
  130. information.
  131. The preprocessing algorithms fall in two groups: Normalization and noise
  132. reduction algorithms.
  133. The most important normalization algorithm in single-symbol recognition is
  134. \textit{scale-and-shift}. It scales the recording so that
  135. its bounding box fits into a unit square. As the aspect ratio of a recording
  136. is almost never 1:1, only one dimension will fit exactly in the unit square.
  137. Then there are multiple ways how to shift the recording. For this paper, it was
  138. chosen to shift the bigger dimension to fit into the $[0,1] \times [0,1]$ unit
  139. square whereas the smaller dimension is centered in the $[-1,1] \times [-1,1]$
  140. square.
  141. Another normalization preprocessing algorithm is resampling. As the data points
  142. on the pen trajectory are generated asynchronously and with different
  143. time-resolutions depending on the used hardware and software, it is desirable
  144. to resample the recordings to have points spread equally in time for every
  145. recording. This was done with linear interpolation of the $(x,t)$ and $(y,t)$
  146. sequences and getting a fixed number of equally spaced samples.
  147. \textit{Connect strokes} is a noise reduction algorithm. It happens sometimes
  148. that the hardware detects that the user lifted the pen where he certainly
  149. didn't do so. This can be detected by measuring the distance between the end of
  150. one stroke and the beginning of the next stroke. If this distance is below a
  151. threshold, then the strokes are connected.
  152. Due to a limited resolution of the recording device and due to erratic
  153. handwriting, the pen trajectory might not be smooth. One way to smooth is
  154. calculating a weighted average and replacing points by the weighted average of
  155. their coordinate and their neighbors coordinates. Another way to do smoothing
  156. would be to reduce the number of points with the Douglas-Peucker algorithm to
  157. the most relevant ones and then interpolate those points. The Douglas-Peucker
  158. stroke simplification algorithm is usually used in cartography to simplify the
  159. shape of roads. The Douglas-Peucker algorithm works recursively to find a
  160. subset of control points of a stroke that is simpler and still similar to the
  161. original shape. The algorithm adds the first and the last point $p_1$ and $p_n$
  162. of a stroke to the simplified set of points $S$. Then it searches the control
  163. point $p_i$ in between that has maximum distance from the \gls{line} $p_1 p_n$.
  164. If this distance is above a threshold $\varepsilon$, the point $p_i$ is added
  165. to $S$. Then the algorithm gets applied to $p_1 p_i$ and $p_i p_n$ recursively.
  166. Pseudocode of this algorithm is on \cpageref{alg:douglas-peucker}. It is
  167. described as \enquote{Algorithm 1} in~\cite{Visvalingam1990} with a different
  168. notation.
  169. \subsection{Features}\label{sec:features}
  170. Features can be global, that means calculated for the complete recording or
  171. complete strokes. Other features are calculated for single points on the
  172. pen trajectory and are called \textit{local}.
  173. Global features are the \textit{number of strokes} in a recording, the
  174. \textit{aspect ratio} of the bounding box of a recordings bounding box or the
  175. \textit{ink} being used for a recording. The ink feature gets calculated by
  176. measuring the length of all strokes combined. The re-curvature, which was
  177. introduced in~\cite{Huang06}, is defined as
  178. \[\text{re-curvature}(stroke) := \frac{\text{height}(stroke)}{\text{length}(stroke)}\]
  179. and a stroke-global feature.
  180. The most important local feature is the coordinate of the point itself.
  181. Speed, curvature and a local small-resolution bitmap around the point, which
  182. was introduced by Manke et al. in~\cite{Manke94} are other local features.
  183. \subsection{Multilayer Perceptrons}\label{sec:mlp-training}
  184. \Glspl{MLP} are explained in detail in~\cite{Mitchell97}. They can have
  185. different numbers of hidden layers, the number of neurons per layer and the
  186. activation functions can be varied. The learning algorithm is parameterized by
  187. the learning rate $\eta$, the momentum $\alpha$ and the number of epochs. The
  188. learning of \glspl{MLP} can be executed in various different ways, for example
  189. with layer-wise supversided pretraining which means if a three layer \gls{MLP}
  190. of the topology $160:500:500:500:369$ should get trained, at first a \gls{MLP}
  191. with one hidden layer ($160:500:369$) is trained. Then the output layer is
  192. discarded, a new hidden layer and a new output layer is added and it is trained
  193. again. Then we have a $160:500:500:369$ \gls{MLP}. The output layer is
  194. discarded again, a new hidden layer is added and a new output layer is added
  195. and the training is executed again.
  196. \section{Evaluation}\label{ch:Evaluation}
  197. In order to evaluate the effect of different preprocessing algorithms, features
  198. and adjustments in the \gls{MLP} training and topology, the following baseline
  199. system was used:
  200. Scale the recording to fit into a unit square while keeping the aspect ratio,
  201. shift it into $[-1,1] \times [-1,1]$ as described in \cref{sec:preprocessing},
  202. resample it with linear interpolation to get 20~points per stroke, spaced
  203. evenly in time. Take the first 4~strokes with 20~points per stroke and
  204. 2~coordinates per point as features, resulting in 160~features which is equal
  205. to the number of input neurons. If a recording has less than 4~strokes, the
  206. remaining features were filled with zeroes.
  207. All experiments were evaluated with four baseline systems $B_i$, $i \in \Set{1,
  208. 2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
  209. could have a severe influence on the effect of new features or preprocessing
  210. steps. Each hidden layer in all evaluated systems has $500$ neurons.
  211. Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
  212. of $\alpha = 0.1$. The activation function of every neuron is
  213. %TODO: Evaluation randomnes
  214. %TODO:
  215. \section{Conclusion}
  216. The aim of this bachelor's thesis was to build a recognition system that
  217. can recognize many mathematical symbols with low error rates as well as to
  218. evaluate which preprocessing steps and features help to improve the recognition
  219. rate.
  220. All recognition systems were trained and evaluated with
  221. $\num{\totalCollectedRecordings{}}$ recordings for \totalClassesAnalyzed{}
  222. symbols. These recordings were collected by two crowdsourcing projects
  223. (\href{http://detexify.kirelabs.org/classify.html}{Detexify} and
  224. \href{write-math.com}{write-math.com}) and created with various devices. While
  225. some recordings were created with standard touch devices such as tablets and
  226. smartphones, others were created with the mouse.
  227. \Glspl{MLP} were used for the classification task. Four baseline systems with
  228. different numbers of hidden layers were used, as the number of hidden layer
  229. influences the capabilities and problems of \glspl{MLP}. Furthermore, an error
  230. measure MER was defined, which takes the top three \glspl{hypothesis} of the classifier,
  231. merges symbols such as \verb+\sum+ ($\sum$) and \verb+\Sigma+ ($\Sigma$) to
  232. equivalence classes, and then calculates the error.
  233. All baseline systems used the same preprocessing queue. The recordings were
  234. scaled to fit into a unit square, shifted to $(0,0)$, resampled with linear
  235. interpolation so that every stroke had exactly 20~points which are spread
  236. equidistant in time. The 80~($x,y$) coordinates of the first 4~strokes were used
  237. to get exactly $160$ input features for every recording. The baseline systems
  238. $B_2$ has a MER error of $\SI{5.67}{\percent}$.
  239. Three variations of the scale and shift algorithm, wild point filtering, stroke
  240. connect, weighted average smoothing, and Douglas-Peucker smoothing were
  241. evaluated. The evaluation showed that the scale and shift algorithm is extremely
  242. important and the connect strokes algorithm improves the classification. All
  243. other preprocessing algorithms either diminished the classification performance
  244. or had less influence on it than the random initialization of the \glspl{MLP}
  245. weights.
  246. Adding two slightly rotated variants for each recording and hence tripling the
  247. training set made the systems $B_3$ and $B_4$ perform much worse, but improved
  248. the performance of the smaller systems.
  249. The global features re-curvature, ink, stoke count and aspect ratio improved the
  250. systems $B_1$--$B_3$, whereas the stroke center point feature made $B_2$ perform
  251. worse.
  252. The learning rate and the momentum were evaluated. A learning rate of $\eta=0.1$
  253. and a momentum of $\alpha=0.9$ gave the best results. Newbob training lead to
  254. much worse recognition rates. Denoising auto-encoders were evaluated as one way
  255. to use pretraining, but by this the error rate increased notably. However,
  256. supervised layer-wise pretraining improved the performance decidedly.
  257. The stroke connect algorithm was added to the preprocessing steps of the
  258. baseline system as well as the re-curvature feature, the ink feature, the number
  259. of strokes and the aspect ratio. The training setup of the baseline system was
  260. changed to supervised layer-wise pretraining and the resulting model was trained
  261. with a lower learning rate again. This optimized recognizer $B_{2,c}'$ had a MER
  262. error of $\SI{3.96}{\percent}$. This means that the MER error dropped by over
  263. $\SI{30}{\percent}$ in comparison to the baseline system $B_2$.
  264. A MER error of $\SI{3.96}{\percent}$ makes the system usable for symbol lookup.
  265. It could also be used as a starting point for the development of a
  266. multiple-symbol classifier.
  267. The aim of this bachelor's thesis was to develop a symbol recognition system
  268. which is easy to use, fast and has high recognition rates as well as evaluating
  269. ideas for single symbol classifiers. Some of those goals were reached. The
  270. recognition system $B_{2,c}'$ evaluates new recordings in a fraction of a second
  271. and has acceptable recognition rates. Many variations algorithms were evaluated.
  272. However, there are still many more algorithms which could be evaluated and, at
  273. the time of this work, the best classifier $B_{2,c}'$ is not publicly available.
  274. \bibliographystyle{IEEEtranSA}
  275. \bibliography{write-math-ba-paper}
  276. \end{document}