9th April 2021

The automatic system of handwritten text recognition (HTR) for Polish lexicographic sources

Jan Idziak, Artjoms Šeļa, Michał Woźniak, Albert Leśniak, Joanna Byszuk, Maciej Eder (Instytut Języka Polskiego PAN)

The paper discusses an approach to decipher large collections of handwritten index cards of historical dictionaries. Our study aims at reading the cards and linking their lemmas to a searchable list of dictionary entries, for a large historical dictionary entitled the \textit{Dictionary of the 17th- and 18th-century Polish}, which comprizes 2.8 million index cards. We apply a tailored handwritten text recognition (HTR) solution that involves (1) an optimized detection model based on keras-ocr-craft; (2) a recognition model for deciphering the handwritten content: it was designed as an STN transformation followed by RCNN with ResNet backbone with CTC layer, trained using the CVIT dataset and a synthetic set of 500,000 generated Polish words of different length; (3) a post-processing step, in which the results returned by CTC (i.e. connectionist temporal classification predictions) were decoded using a Constrained Word Beam Search: the predictions were matched against a list of dictionary entries known in advance. Our model achieved the accuracy of 0.881 on the word level, which can be considered a competitive result to the base model offered by an RCNN network. Within this study we produced a set of 20,000 manually annotated index cards that can be used for future benchmarks and transfer learning HTR applications.



The meeting will take place live at Zoom at 1 pm. To participate please fill in the survey: https://forms.gle/4K1MJ7V9JW8MDKmq7 – the link to the meeting will be sent sent to the email address passed in the form.

The first part of the meeting (the lecture) will be recorded to be later uploaded to our YouTube channel. While we will only be recording the slides and speaker’s audio, we kindly ask that those of you who do not want to risk accidental sharing of your personal image turn off the cameras and turn them back on in the second part of the meeting, a discussion, which will not be recorded.