write-math-ba-paper.tex 32 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587
  1. \documentclass[9pt,technote]{IEEEtran}
  2. \usepackage{amssymb, amsmath} % needed for math
  3. \usepackage{hyperref} % links im text
  4. \usepackage{parskip}
  5. \usepackage[pdftex,final]{graphicx}
  6. \usepackage{csquotes}
  7. \usepackage{braket}
  8. \usepackage{booktabs}
  9. \usepackage{multirow}
  10. \usepackage{pgfplots}
  11. \usepackage{ wasysym }
  12. \usepackage[noadjust]{cite}
  13. \usepackage[nameinlink,noabbrev]{cleveref} % has to be after hyperref, ntheorem, amsthm
  14. \usepackage[binary-units]{siunitx}
  15. \sisetup{per-mode=fraction,binary-units=true}
  16. \DeclareSIUnit\pixel{px}
  17. \usepackage{glossaries}
  18. \loadglsentries[main]{glossary}
  19. \makeglossaries
  20. \title{On-line Recognition of Handwritten Mathematical Symbols}
  21. \author{Martin Thoma, Kevin Kilgour, Sebastian St{\"u}ker and Alexander Waibel}
  22. \hypersetup{
  23. pdfauthor = {Martin Thoma, Kevin Kilgour, Sebastian St{\"u}ker and Alexander Waibel},
  24. pdfkeywords = {Mathematics,Symbols,recognition},
  25. pdftitle = {On-line Recognition of Handwritten Mathematical Symbols}
  26. }
  27. \include{variables}
  28. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  29. % Begin document %
  30. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  31. \begin{document}
  32. \maketitle
  33. \begin{abstract}
  34. The automatic recognition of single handwritten symbols has three main
  35. applications. The first application is to support users who know how a symbol
  36. looks like, but not what its name is such as $\saturn$. The second application
  37. is providing the necessary commands for professional publishing in books or on
  38. websites, e.g. in form of \LaTeX{} commands, as MathML, or as code points. The
  39. third application of single symbol classifiers is in form of a building block
  40. for formula recognition.
  41. This paper presents a system
  42. which uses the pen trajectory to classify handwritten symbols. Five
  43. preprocessing steps, one data multiplication algorithm, five features and five
  44. variants for multilayer Perceptron training were evaluated using $\num{166898}$
  45. recordings which were collected with two crowdsourcing projects. The evaluation
  46. results of these 21~experiments were used to create an optimized recognizer
  47. which has a TOP-1 error of less than $\SI{17.5}{\percent}$ and a TOP-3 error of
  48. $\SI{4.0}{\percent}$. This is a relative improvement of $\SI{18.5}{\percent}$
  49. for the TOP-1 error and $\SI{29.7}{\percent}$ for the TOP-3 error compared to
  50. the baseline system.
  51. \end{abstract}
  52. \section{Introduction}
  53. On-line recognition makes use of the pen trajectory. This means the data is
  54. given as groups of sequences of tuples $(x, y, t) \in \mathbb{R}^3$, where each
  55. group represents a stroke, $(x, y)$ is the position of the pen on a canvas and
  56. $t$ is the time.
  57. On-line data was used to classify handwritten natural language text in many
  58. different variants. For example, the NPen++ system classified cursive
  59. handwriting into English words by using hidden Markov models and neural
  60. networks\cite{Manke1995}.
  61. % One handwritten symbol in the described format is called a
  62. % \textit{recording}. One approach to classify recordings into symbol classes
  63. % assigns a probability to each class given the data. The classifier can be
  64. % evaluated by using recordings which were classified by humans and were not used
  65. % to train the classifier. The set of those recordings is called \textit{test
  66. % set}. The TOP-$n$ error is defined as the fraction of the symbols where
  67. % the correct class was not within the top $n$ classes of the highest
  68. % probability.
  69. Several systems for mathematical symbol recognition with on-line data have been
  70. described so far~\cite{Kosmala98,Mouchere2013}, but no standard test set
  71. existed to compare the results of different classifiers. The used symbols
  72. differed in all papers. This is unfortunate as the choice of symbols is crucial
  73. for the TOP-$n$ error. For example, the symbols $o$, $O$, $\circ$ and $0$ are
  74. very similar and systems which know all those classes will certainly have a
  75. higher TOP-$n$ error than systems which only accept one of them. But not only
  76. the classes differed, also the used data to train and test had to be collected
  77. by each author again.
  78. Daniel Kirsch describes in~\cite{Kirsch} a system called Detexify which uses
  79. time warping to classify on-line handwritten symbols and reports a TOP-3 error
  80. of less than $\SI{10}{\percent}$ for a set of $\num{100}$~symbols. He did also
  81. recently publish his data on \url{https://github.com/kirel/detexify-data},
  82. which was collected by a crowdsourcing approach via
  83. \url{http://detexify.kirelabs.org}. Those recordings as well as some recordings
  84. which were collected by a similar approach via \url{http://write-math.com} were
  85. used to train and evaluated different classifiers. A complete description of
  86. all involved software, data and experiments is given in~\cite{Thoma:2014}.
  87. In this paper we present a baseline system for the classification of on-line
  88. handwriting into $369$ classes of which some are very similar. An optimized
  89. classifier which has a $\SI{29.7}{\percent}$ relative improvement of the TOP-3
  90. error. This was achieved by using better features and layer-wise supervised
  91. pretraining. The absolute improvements compared to the baseline of those
  92. changes will also be shown.
  93. \section{Steps in Handwriting Recognition}
  94. The following steps are used for symbol classification:\nobreak
  95. \begin{enumerate}
  96. \item \textbf{Preprocessing}: Recorded data is never perfect. Devices have
  97. errors and people make mistakes while using the devices. To tackle
  98. these problems there are preprocessing algorithms to clean the data.
  99. The preprocessing algorithms can also remove unnecessary variations
  100. of the data that do not help in the classification process, but hide
  101. what is important. Having slightly different sizes of the same symbol
  102. is an example of such a variation. Four preprocessing algorithms that
  103. clean or normalize recordings are explained in
  104. \cref{sec:preprocessing}.
  105. \item \textbf{Data multiplication}: Learning algorithms need lots of data
  106. to learn internal parameters. If there is not enough data available,
  107. domain knowledge can be considered to create new artificial data from
  108. the original data. In the domain of on-line handwriting recognition,
  109. data can be multiplied by adding rotated variants.
  110. \item \textbf{Segmentation}: The task of formula recognition can eventually
  111. be reduced to the task of symbol recognition combined with symbol
  112. placement. Before symbol recognition can be done, the formula has
  113. to be segmented. As this paper is only about single-symbol
  114. recognition, this step will not be further discussed.
  115. \item \textbf{Feature computation}: A feature is high-level information
  116. derived from the raw data after preprocessing. Some systems like
  117. Detexify take the result of the preprocessing step, but many compute
  118. new features. Those features could be designed by a human engineer or
  119. learned. Non-raw data features can have the advantage that less
  120. training data is needed since the developer can use knowledge about
  121. handwriting to compute highly discriminative features. Various
  122. features are explained in \cref{sec:features}.
  123. \item \textbf{Feature enhancement}: Applying PCA, LDA, or
  124. feature standardization might change the features in ways that could
  125. improve the performance of learning algorithms.
  126. \end{enumerate}
  127. After these steps, we are faced with a classification learning task which
  128. consists of two parts:
  129. \begin{enumerate}
  130. \item \textbf{Learning} parameters for a given classifier.
  131. \item \textbf{Classifying} new recordings, sometimes called
  132. \textit{evaluation}. This should not be confused with the evaluation
  133. of the classification performance which is done for multiple
  134. topologies, preprocessing queues, and features in
  135. \Cref{ch:Evaluation}.
  136. \end{enumerate}
  137. The classification learning task can be solved with \glspl{MLP} if the number
  138. of input features is the same for every recording. There are many ways how to
  139. adjust \glspl{MLP} and how to adjust their training. Some of them are
  140. described in~\cref{sec:mlp-training}.
  141. \section{Data and Implementation}
  142. The combined data of Detexify and \href{http://write-math.com}{write-math.com}
  143. can be downloaded via \href{http://write-math.com/data}{write-math.com/data} as
  144. a compressed tar archive. It contains a list of $369$ symbols which are used in
  145. mathematical context. Each symbol has at least $50$ labeled examples, but most
  146. symbols have more than $200$ labeled examples and some have more than $2000$.
  147. In total, more than $\num{160000}$ labeled recordings were collected.
  148. Preprocessing and feature computation algorithms were implemented and are
  149. publicly available as open-source software in the Python package \texttt{hwrt}
  150. and \gls{MLP} algorithms are available in the Python package
  151. \texttt{nntoolkit}.
  152. \section{Algorithms}
  153. \subsection{Preprocessing}\label{sec:preprocessing}
  154. Preprocessing in symbol recognition is done to improve the quality and
  155. expressive power of the data. It should make follow-up tasks like segmentation
  156. and feature extraction easier, more effective or faster. It does so by resolving
  157. errors in the input data, reducing duplicate information and removing irrelevant
  158. information.
  159. Preprocessing algorithms fall into two groups: Normalization and noise
  160. reduction algorithms.
  161. A very important normalization algorithm in single-symbol recognition is
  162. \textit{scale-and-shift}~\cite{Thoma:2014}. It scales the recording so that
  163. its bounding box fits into a unit square. As the aspect ratio of a recording
  164. is almost never 1:1, only one dimension will fit exactly in the unit square.
  165. There are multiple ways how to shift the recording. For this paper, it was
  166. chosen to shift the bigger dimension to fit into the $[0,1] \times [0,1]$ unit
  167. square whereas the smaller dimension is centered in the $[-1,1] \times [-1,1]$
  168. square.
  169. Another normalization preprocessing algorithm is resampling. As the data points
  170. on the pen trajectory are generated asynchronously and with different
  171. time-resolutions depending on the used hardware and software, it is desirable
  172. to resample the recordings to have points spread equally in time for every
  173. recording. This was done by linear interpolation of the $(x,t)$ and $(y,t)$
  174. sequences and getting a fixed number of equally spaced points per stroke.
  175. \textit{Connect strokes} is a noise reduction algorithm. It happens sometimes
  176. that the hardware detects that the user lifted the pen where the user certainly
  177. didn't do so. This can be detected by measuring the Euclidean distance between
  178. the end of one stroke and the beginning of the next stroke. If this distance is
  179. below a threshold, then the strokes are connected.
  180. Due to a limited resolution of the recording device and due to erratic
  181. handwriting, the pen trajectory might not be smooth. One way to smooth is
  182. calculating a weighted average and replacing points by the weighted average of
  183. their coordinate and their neighbors coordinates. Another way to do smoothing
  184. would be to reduce the number of points with the Douglas-Peucker algorithm to
  185. the most relevant ones and then interpolate the stroke between those points.
  186. The Douglas-Peucker stroke simplification algorithm is usually used in
  187. cartography to simplify the shape of roads. It works recursively to find a
  188. subset of points of a stroke that is simpler and still similar to the original
  189. shape. The algorithm adds the first and the last point $p_1$ and $p_n$ of a
  190. stroke to the simplified set of points $S$. Then it searches the point $p_i$ in
  191. between that has maximum distance from the line $p_1 p_n$. If this
  192. distance is above a threshold $\varepsilon$, the point $p_i$ is added to $S$.
  193. Then the algorithm gets applied to $p_1 p_i$ and $p_i p_n$ recursively. It is
  194. described as \enquote{Algorithm 1} in~\cite{Visvalingam1990}.
  195. \subsection{Features}\label{sec:features}
  196. Features can be \textit{global}, that means calculated for the complete
  197. recording or complete strokes. Other features are calculated for single points
  198. on the pen trajectory and are called \textit{local}.
  199. Global features are the \textit{number of strokes} in a recording, the
  200. \textit{aspect ratio} of a recordings bounding box or the
  201. \textit{ink} being used for a recording. The ink feature gets calculated by
  202. measuring the length of all strokes combined. The re-curvature, which was
  203. introduced in~\cite{Huang06}, is defined as
  204. \[\text{re-curvature}(stroke) := \frac{\text{height}(stroke)}{\text{length}(stroke)}\]
  205. and a stroke-global feature.
  206. The simplest local feature is the coordinate of the point itself. Speed,
  207. curvature and a local small-resolution bitmap around the point, which was
  208. introduced by Manke, Finke and Waibel in~\cite{Manke1995}, are other local
  209. features.
  210. \subsection{Multilayer Perceptrons}\label{sec:mlp-training}
  211. \Glspl{MLP} are explained in detail in~\cite{Mitchell97}. They can have
  212. different numbers of hidden layers, the number of neurons per layer and the
  213. activation functions can be varied. The learning algorithm is parameterized by
  214. the learning rate $\eta \in (0, \infty)$, the momentum $\alpha \in [0, \infty)$
  215. and the number of epochs.
  216. The topology of \glspl{MLP} will be denoted in the following by separating the
  217. number of neurons per layer with colons. For example, the notation
  218. $160{:}500{:}500{:}500{:}369$ means that the input layer gets 160~features,
  219. there are three hidden layers with 500~neurons per layer and one output layer
  220. with 369~neurons.
  221. \glspl{MLP} training can be executed in various different ways, for example
  222. with \gls{SLP}. In case of a \gls{MLP} with the topology
  223. $160{:}500{:}500{:}500{:}369$, \gls{SLP} works as follows: At first a \gls{MLP}
  224. with one hidden layer ($160{:}500{:}369$) is trained. Then the output layer is
  225. discarded, a new hidden layer and a new output layer is added and it is trained
  226. again, resulting in a $160{:}500{:}500{:}369$ \gls{MLP}. The output layer is
  227. discarded again, a new hidden layer is added and a new output layer is added
  228. and the training is executed again.
  229. Denoising auto-encoders are another way of pretraining. An
  230. \textit{auto-encoder} is a neural network that is trained to restore its input.
  231. This means the number of input neurons is equal to the number of output
  232. neurons. The weights define an \textit{encoding} of the input that allows
  233. restoring the input. As the neural network finds the encoding by itself, it is
  234. called auto-encoder. If the hidden layer is smaller than the input layer, it
  235. can be used for dimensionality reduction~\cite{Hinton1989}. If only one hidden
  236. layer with linear activation functions is used, then the hidden layer contains
  237. the principal components after training~\cite{Duda2001}.
  238. Denoising auto-encoders are a variant introduced in~\cite{Vincent2008} that
  239. is more robust to partial corruption of the input features. It is trained to
  240. get robust by adding noise to the input features.
  241. There are multiple ways how noise can be added. Gaussian noise and
  242. randomly masking elements with zero are two possibilities. \cite{Deeplearning-Denoising-AE}
  243. describes how such a denoising auto-encoder with masking noise can be
  244. implemented. The \texttt{corruption} is the probability of a feature being
  245. masked.
  246. \section{Evaluation}\label{ch:Evaluation}
  247. In order to evaluate the effect of different preprocessing algorithms, features
  248. and adjustments in the \gls{MLP} training and topology, the following baseline
  249. system was used:
  250. Scale the recording to fit into a unit square while keeping the aspect ratio,
  251. shift it into $[-1,1] \times [-1,1]$ as described in \cref{sec:preprocessing},
  252. resample it with linear interpolation to get 20~points per stroke, spaced
  253. evenly in time. Take the first 4~strokes with 20~points per stroke and
  254. 2~coordinates per point as features, resulting in 160~features which is equal
  255. to the number of input neurons. If a recording has less than 4~strokes, the
  256. remaining features were filled with zeroes.
  257. All experiments were evaluated with four baseline systems $B_i$, $i \in \Set{1,
  258. 2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
  259. could have a severe influence on the effect of new features or preprocessing
  260. steps. Each hidden layer in all evaluated systems has $500$ neurons.
  261. Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
  262. of $\alpha = 0.1$. The activation function of every neuron in a hidden layer is
  263. the sigmoid function $\text{sig}(x) := \frac{1}{1+e^{-x}}$. The neurons in the
  264. output layer use the softmax function. For every experiment, exactly one part
  265. of the baseline systems was changed.
  266. \subsection{Random Weight Initialization}
  267. The neural networks in all experiments got initialized with a small random
  268. weight
  269. \[w_{i,j} \sim U(-4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}}, 4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}})\]
  270. where $w_{i,j}$ is the weight between the neurons $i$ and $j$, $l$ is the layer
  271. of neuron $i$, and $n_i$ is the number of neurons in layer $i$. This random
  272. initialization was suggested on
  273. \cite{deeplearningweights} and is done to break symmetry.
  274. This might lead to different error rates for the same systems just because the
  275. initialization was different.
  276. In order to get an impression of the magnitude of the influence on the different
  277. topologies and error rates the baseline models were trained 5 times with
  278. random initializations.
  279. \Cref{table:baseline-systems-random-initializations-summary}
  280. shows a summary of the results. The more hidden layers are used, the more do
  281. the results vary between different random weight initializations.
  282. \begin{table}[h]
  283. \centering
  284. \begin{tabular}{crrr|rrr} %chktex 44
  285. \toprule
  286. \multirow{3}{*}{System} & \multicolumn{6}{c}{Classification error}\\
  287. \cmidrule(l){2-7}
  288. & \multicolumn{3}{c}{TOP-1} & \multicolumn{3}{c}{TOP-3}\\
  289. & min & max & range & min & max & range\\\midrule
  290. $B_1$ & $\SI{23.08}{\percent}$ & $\SI{23.44}{\percent}$ & $\SI{0.36}{\percent}$ & $\SI{6.67}{\percent}$ & $\SI{6.80}{\percent}$ & $\SI{0.13}{\percent}$ \\
  291. $B_2$ & \underline{$\SI{21.45}{\percent}$} & \underline{$\SI{21.83}{\percent}$}& $\SI{0.38}{\percent}$ & $\SI{5.68}{\percent}$ & \underline{$\SI{5.75}{\percent}$} & $\SI{0.07}{\percent}$\\
  292. $B_3$ & $\SI{21.54}{\percent}$ & $\SI{22.28}{\percent}$ & $\SI{0.74}{\percent}$ & \underline{$\SI{5.50}{\percent}$} & $\SI{5.82}{\percent}$ & $\SI{0.32}{\percent}$\\
  293. $B_4$ & $\SI{23.19}{\percent}$ & $\SI{24.84}{\percent}$ & $\SI{1.65}{\percent}$ & $\SI{5.98}{\percent}$ & $\SI{6.44}{\percent}$ & $\SI{0.46}{\percent}$\\
  294. \bottomrule
  295. \end{tabular}
  296. \caption{The systems $B_1$ -- $B_4$ were randomly initialized, trained
  297. and evaluated 5~times to estimate the influence of random weight
  298. initialization.}
  299. \label{table:baseline-systems-random-initializations-summary}
  300. \end{table}
  301. \subsection{Connect strokes}
  302. In order to solve the problem of interrupted strokes, pairs of strokes
  303. can be connected with stroke connect algorithm. The idea is that for
  304. a pair of consecutively drawn strokes $s_{i}, s_{i+1}$ the last point $s_i$ is
  305. close to the first point of $s_{i+1}$ if a stroke was accidentally split
  306. into two strokes.
  307. $\SI{59}{\percent}$ of all stroke pair distances in the collected data are
  308. between $\SI{30}{\pixel}$ and $\SI{150}{\pixel}$. Hence the stroke connect
  309. algorithm was tried with $\SI{5}{\pixel}$, $\SI{10}{\pixel}$ and
  310. $\SI{20}{\pixel}$.
  311. All models TOP-3 error improved with a threshold of $\theta = \SI{10}{\pixel}$
  312. by at least $\SI{0.17}{\percent}$, except $B_4$ which improved only by
  313. $\SI{0.01}{\percent}$ which could be a result of random weight initialization.
  314. \subsection{Douglas-Peucker Smoothing}
  315. The Douglas-Peucker algorithm can be used to find
  316. points that are more relevant for the overall shape of a recording. After that,
  317. an interpolation can be done. If the interpolation is a cubic spline
  318. interpolation, this makes the recording smooth.
  319. The Douglas-Peucker algorithm was applied with a threshold of $\varepsilon =
  320. 0.05$, $\varepsilon = 0.1$ and $\varepsilon = 0.2$ after scaling and shifting,
  321. but before resampling. The interpolation in the resampling step was done
  322. linearly and with cubic splines in two experiments. The recording was scaled
  323. and shifted again after the interpolation because the bounding box might have
  324. changed.
  325. The result of the application of the Douglas-Peucker smoothing with $\varepsilon
  326. > 0.05$ was a high rise of the TOP-1 and TOP-3 error for all models $B_i$.
  327. This means that the simplification process removes some relevant information and
  328. does not --- as it was expected --- remove only noise. For $\varepsilon = 0.05$
  329. with linear interpolation some models TOP-1 error improved, but the
  330. changes were small. It could be an effect of random weight initialization.
  331. However, cubic spline interpolation made all systems perform more than
  332. $\SI{1.7}{\percent}$ worse for TOP-1 and TOP-3 error.
  333. The lower the value of $\varepsilon$, the less does the recording change after
  334. this preprocessing step. As it was applied after scaling the recording such that
  335. the biggest dimension of the recording (width or height) is $1$, a value of
  336. $\varepsilon = 0.05$ means that a point has to move at least $\SI{5}{\percent}$
  337. of the biggest dimension.
  338. \subsection{Global Features}
  339. Single global features were added one at a time to the baseline systems. Those
  340. features were re-curvature $\text{re-curvature}(stroke) = \frac{\text{height}(stroke)}{\text{length}(stroke)}$
  341. as described in \cite{Huang06}, the ink feature which is the summed length
  342. of all strokes, the stroke count, the aspect ratio and the stroke center points
  343. for the first four strokes. The stroke center point feature improved the system
  344. $B_1$ by $\SI{0.27}{\percent}$ for the TOP-3 error and system $B_3$ for the
  345. TOP-1 error by $\SI{0.74}{\percent}$, but all other systems and error measures
  346. either got worse or did not improve much.
  347. The other global features did improve the systems $B_1 -- B_3$, but not $B_4$.
  348. The highest improvement was achieved with the re-curvature feature. It
  349. improved the systems $B_1 -- B_4$ by more than $\SI{0.6}{\percent}$ TOP-1 error.
  350. \subsection{Data Multiplication}
  351. Data multiplication can be used to make the model invariant to transformations.
  352. However, this idea seems not to work well in the domain of on-line handwritten
  353. mathematical symbols. It was tried to triple the data by adding a rotated
  354. version that is rotated 3 degrees to the left and another one that is rotated
  355. 3 degrees to the right around the center of mass. This data multiplication
  356. made all classifiers for most error measures perform worse by more than
  357. $\SI{2}{\percent}$ for the TOP-1 error.
  358. \subsection{Pretraining}\label{subsec:pretraining-evaluation}
  359. Pretraining is a technique used to improve the training of \glspl{MLP} with
  360. multiple hidden layers.
  361. \Cref{fig:training-and-test-error-for-different-topologies-pretraining} shows
  362. the evolution of the TOP-1 error over 1000~epochs with supervised
  363. layer-wise pretraining and without pretraining. It clearly shows that this
  364. kind of pretraining improves the classification performance by $\SI{1.6}{\percent}$
  365. for the TOP-1 error and $\SI{1.0}{\percent}$ for the TOP-3 error.
  366. \begin{figure}[htb]
  367. \centering
  368. \input{figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex}
  369. \caption{Training- and test error by number of trained epochs for different
  370. topologies with \gls{SLP}. The plot shows
  371. that all pretrained systems performed much better than the systems
  372. without pretraining. All plotted systems did not improve
  373. with more epochs of training.}
  374. \label{fig:training-and-test-error-for-different-topologies-pretraining}
  375. \end{figure}
  376. Pretraining with denoising auto-encoder lead to the much worse results listed in
  377. \cref{table:pretraining-denoising-auto-encoder}. The first layer used a $\tanh$
  378. activation function. Every layer was trained for $1000$ epochs and the
  379. \gls{MSE} loss function. A learning-rate of $\eta = 0.001$, a corruption of
  380. $0.3$ and a $L_2$ regularization of $\lambda = 10^{-4}$ were chosen. This
  381. pretraining setup made all systems with all error measures perform much worse.
  382. \begin{table}[tb]
  383. \centering
  384. \begin{tabular}{lrrrr}
  385. \toprule
  386. \multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
  387. \cmidrule(l){2-5}
  388. & TOP-1 & change & TOP-3 & change \\\midrule
  389. $B_{1,p}$ & $\SI{23.75}{\percent}$ & $\SI{+0.41}{\percent}$ & $\SI{7.19}{\percent}$ & $\SI{+0.39}{\percent}$\\
  390. $B_{2,p}$ & \underline{$\SI{22.76}{\percent}$} & $\SI{+1.25}{\percent}$ & $\SI{6.38}{\percent}$ & $\SI{+0.63}{\percent}$\\
  391. $B_{3,p}$ & $\SI{23.10}{\percent}$ & $\SI{+1.17}{\percent}$ & \underline{$\SI{6.14}{\percent}$} & $\SI{+0.40}{\percent}$\\
  392. $B_{4,p}$ & $\SI{25.59}{\percent}$ & $\SI{+1.71}{\percent}$ & $\SI{6.99}{\percent}$ & $\SI{+0.87}{\percent}$\\
  393. \bottomrule
  394. \end{tabular}
  395. \caption{Systems with denoising auto-encoder pretraining compared to pure
  396. gradient descent. The pretrained systems clearly performed worse.}
  397. \label{table:pretraining-denoising-auto-encoder}
  398. \end{table}
  399. \subsection{Optimized Recognizer}
  400. All preprocessing steps and features that were useful were combined to
  401. create a recognizer that should perform best.
  402. All models were much better than everything that was tried before. The results
  403. of this experiment show that single-symbol recognition with
  404. \totalClassesAnalyzed{} classes and usual touch devices and the mouse can be
  405. done with a TOP1 error rate of $\SI{18.56}{\percent}$ and a TOP3 error of
  406. $\SI{4.11}{\percent}$. This was
  407. achieved by a \gls{MLP} with a $167{:}500{:}500{:}\totalClassesAnalyzed{}$ topology.
  408. It used an algorithm to connect strokes of which the ends were less than
  409. $\SI{10}{\pixel}$ away, scaled each recording to a unit square and shifted this
  410. unit square to $(0,0)$. After that, a linear resampling step was applied to the
  411. first 4 strokes to resample them to 20 points each. All other strokes were
  412. discarded.
  413. The 167 features were
  414. \begin{itemize}
  415. \item the first 4 strokes with 20 points per stroke resulting in 160
  416. features,
  417. \item the re-curvature for the first 4 strokes,
  418. \item the ink,
  419. \item the number of strokes and
  420. \item the aspect ratio
  421. \end{itemize}
  422. \Gls{SLP} was applied with $\num{1000}$ epochs per layer, a
  423. learning rate of $\eta=0.1$ and a momentum of $\alpha=0.1$. After that, the
  424. complete model was trained again for $1000$ epochs with standard mini-batch
  425. gradient descent.
  426. After the models $B_{1,c}$ -- $B_{4,c}$ were trained the first $1000$ epochs,
  427. they were trained again for $1000$ epochs with a learning rate of $\eta = 0.05$.
  428. \Cref{table:complex-recognizer-systems-evaluation} shows that
  429. this improved the classifiers again.
  430. \begin{table}[htb]
  431. \centering
  432. \begin{tabular}{lrrrr}
  433. \toprule
  434. \multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
  435. \cmidrule(l){2-5}
  436. & TOP1 & change & TOP3 & change\\\midrule
  437. $B_{1,c}$ & $\SI{20.96}{\percent}$ & $\SI{-2.38}{\percent}$ & $\SI{5.24}{\percent}$ & $\SI{-1.56}{\percent}$\\
  438. $B_{2,c}$ & $\SI{18.26}{\percent}$ & $\SI{-3.25}{\percent}$ & $\SI{4.07}{\percent}$ & $\SI{-1.68}{\percent}$\\
  439. $B_{3,c}$ & \underline{$\SI{18.19}{\percent}$} & $\SI{-3.74}{\percent}$ & \underline{$\SI{4.06}{\percent}$} & $\SI{-1.68}{\percent}$\\
  440. $B_{4,c}$ & $\SI{18.57}{\percent}$ & $\SI{-5.31}{\percent}$ & $\SI{4.25}{\percent}$ & $\SI{-1.87}{\percent}$\\\midrule
  441. $B_{1,c}'$ & $\SI{19.33}{\percent}$ & $\SI{-1.63}{\percent}$ & $\SI{4.78}{\percent}$ & $\SI{-0.46}{\percent}$ \\
  442. $B_{2,c}'$ & \underline{$\SI{17.52}{\percent}$} & $\SI{-0.74}{\percent}$ & \underline{$\SI{4.04}{\percent}$} & $\SI{-0.03}{\percent}$\\
  443. $B_{3,c}'$ & $\SI{17.65}{\percent}$ & $\SI{-0.54}{\percent}$ & $\SI{4.07}{\percent}$ & $\SI{+0.01}{\percent}$\\
  444. $B_{4,c}'$ & $\SI{17.82}{\percent}$ & $\SI{-0.75}{\percent}$ & $\SI{4.26}{\percent}$ & $\SI{+0.01}{\percent}$\\
  445. \bottomrule
  446. \end{tabular}
  447. \caption{Error rates of the optimized recognizer systems. The systems
  448. $B_{i,c}'$ were trained another $1000$ epochs with a learning rate
  449. of $\eta=0.05$. The value of the column \enquote{change} of the
  450. systems $B_{i,c}'$ is relative to $B_{i,c}$.}
  451. \label{table:complex-recognizer-systems-evaluation}
  452. \end{table}
  453. \section{Discussion}
  454. Four baseline recognition systems were adjusted in many experiments and their
  455. recognition capabilities were compared in order to build a recognition system
  456. that can recognize 396 mathematical symbols with low error rates as well as to
  457. evaluate which preprocessing steps and features help to improve the recognition
  458. rate.
  459. All recognition systems were trained and evaluated with
  460. $\num{\totalCollectedRecordings{}}$ recordings for \totalClassesAnalyzed{}
  461. symbols. These recordings were collected by two crowdsourcing projects
  462. (\href{http://detexify.kirelabs.org/classify.html}{Detexify} and
  463. \href{write-math.com}{write-math.com}) and created with various devices. While
  464. some recordings were created with standard touch devices such as tablets and
  465. smartphones, others were created with the mouse.
  466. \Glspl{MLP} were used for the classification task. Four baseline systems with
  467. different numbers of hidden layers were used, as the number of hidden layer
  468. influences the capabilities and problems of \glspl{MLP}.
  469. All baseline systems used the same preprocessing queue. The recordings were
  470. scaled to fit into a unit square, shifted to $(0,0)$, resampled with linear
  471. interpolation so that every stroke had exactly 20~points which are spread
  472. equidistant in time. The 80~($x,y$) coordinates of the first 4~strokes were used
  473. to get exactly $160$ input features for every recording. The baseline system
  474. $B_2$ has a TOP-3 error of $\SI{5.75}{\percent}$.
  475. Adding two slightly rotated variants for each recording and hence tripling the
  476. training set made the systems $B_3$ and $B_4$ perform much worse, but improved
  477. the performance of the smaller systems.
  478. The global features re-curvature, ink, stoke count and aspect ratio improved the
  479. systems $B_1$--$B_3$, whereas the stroke center point feature made $B_2$ perform
  480. worse.
  481. Denoising auto-encoders were evaluated as one way
  482. to use pretraining, but by this the error rate increased notably. However,
  483. supervised layer-wise pretraining improved the performance decidedly.
  484. The stroke connect algorithm was added to the preprocessing steps of the
  485. baseline system as well as the re-curvature feature, the ink feature, the number
  486. of strokes and the aspect ratio. The training setup of the baseline system was
  487. changed to supervised layer-wise pretraining and the resulting model was trained
  488. with a lower learning rate again. This optimized recognizer $B_{2,c}'$ had a TOP-3
  489. error of $\SI{4.04}{\percent}$. This means that the TOP-3 error dropped by over
  490. $\SI{1.7}{\percent}$ in comparison to the baseline system $B_2$.
  491. A TOP-3 error of $\SI{4.04}{\percent}$ makes the system usable for symbol lookup.
  492. It could also be used as a starting point for the development of a
  493. multiple-symbol classifier.
  494. The aim of this work was to develop a symbol recognition system which is easy
  495. to use, fast and has high recognition rates as well as evaluating ideas for
  496. single symbol classifiers. Some of those goals were reached. The recognition
  497. system $B_{2,c}'$ evaluates new recordings in a fraction of a second and has
  498. acceptable recognition rates.
  499. % Many algorithms were evaluated.
  500. % However, there are still many other algorithms which could be evaluated and, at
  501. % the time of this work, the best classifier $B_{2,c}'$ is only available
  502. % through the Python package \texttt{hwrt}. It is planned to add an web version
  503. % of that classifier online.
  504. \bibliographystyle{IEEEtranSA}
  505. \bibliography{write-math-ba-paper}
  506. \end{document}