The Third Eye: Artificial Intelligence and Reading Manuscripts

Chicken Scratch

In our genealogy team, we used to joke that long ago, to get a job in the Civil Registry Office, you had to be able to sign documents with your left foot and confirm your dysgraphia. (We deny the rumor: it was not necessary.)

However, hair was pulled out from the head of everyone who started their adventure with collections of old manuscripts – from literary scholars and historians to genealogy enthusiasts. How many disappointments, frustrations and doubts can be caused by meetings with handwritten collections from a century or three ago? Can tell those who have found out in the archival reading room that knowledge of a foreign language or even a foreign alphabet may be not enough when we want to read the documents created years ago. And although students or graduates of history and related faculties who had the opportunity to compete with handwriting during palaeo- and neography classes are usually less shocked, the “close encounters of the third kind” always stay in our memory.

Of course, “practice makes perfect” and the more time we spend on (for example) genealogical documents or in manuscript reading rooms, the fewer problems we have to decode a message that was put on a paper by a foreign hand years ago. Marital status documents have a specific, repeatable structure, and with each page of a less predictable text (diary, letter, literary work) our eyes will get used to someone else’s handwriting and recognize possible variants of each letter. After reading a few textbooks for palaeography or neography, we will easily recognize the period of time in which a given note was created, basing on the typeface, we will find out how they changed the shape of the letter (and the forms of writing sounds) in Latin, Polish, Russian (Ruthenian) and German texts, Yiddish… – and any others we come across during our archival searches.

Today it is difficult not to use the achievements of technology in working with the manuscript. Thanks to digitization, anyone can view documents in archives and libraries around the world from the comfort of their own home. Sometimes, even in these physical archives or libraries, we simply take dozens or even hundreds of photographs to analyze the content of the manuscripts also outside the walls and opening hours of the reading room. Were it not for this, visits to institutions that store historical sources (well, now, due to epidemiological restrictions, we miss them a bit), depending on the scale and design, could last several months instead of weeks, several weeks instead of days …

But: let’s go a step further. And no, we are not talking about recommending meditation to genealogists angry about a piece of paper full of extreme doodles. The third eye was “grown” for us by someone else.

Shortcut: HTR

Today, artificial intelligence comes to the aid to the genealogists and any other readers of historical texts. Why should you consider making friends with AI? Oh, the answer is trivial: because it might be useful. Especially when implementing larger projects, such as the transcription and translation of historical texts or the indexing of record books. The computer can read the manuscript for us.

First, a little glossary:

HTR (Handwritten Text Recognition), also known as HWR (Handwriting Recognition), is handwriting recognition. HTR systems are used to format, segment and, finally, identify stored characters.

Segmentation is nothing more than marking out the path that the writer’s hand led, leaving a mark on the paper, with the tip of a pen, pen or pencil. It does not differ much from ballistic analysis, but instead of a projectile’s flight, the trajectory of ink on a piece of paper is reproduced.

ICR (Intelligent Character Recognition) is a handwriting recognition system that allows your computer to learn fonts and different handwriting characters. It is an advanced OCR (Optical Character Recognition) system, i.e. optical character recognition. ICR-based programs learn by themselves. They are based on the so-called artificial neural networks, i.e. mathematical structures that carry out calculations and process signals, “training” our tools to – in this case – acquire a new character, a typeface.

In two sentences: the graphic material, i.e. a scan of our manuscript, is digitally processed, noise canceled, normalized and segmented. An intelligent program performs a geometric analysis and calculations, then adjusts the obtained information to dictionaries and linguistic knowledge.

Exemplary Programs

Trancript https://www.jacobboerema.nl/en/Freeware.htm 

Transcript is an uncomplicated program for transcribing manuscripts. It is possible to view photos and create text at the same time, so we can avoid working in two programs, windows or with two monitors at the same time. From the editor level, you can move the visible part of the image in many ways using shortcuts, and in the other part of the screen you can work on transcription or translation. The program is free for personal use.

FromThePage https://beta.fromthepage.com/ 

FromThePage is an open source tool that allows volunteers to collaborate on the transcription of handwritten documents. Thanks to it, many authors can join forces in transcribing one text: a manuscript of a literary work, a collection of marital status documents or a diary.

eLaborate https://elaborate.huygens.knaw.nl/ 

eLaborate is a platform where you can send scans to later transcribe them, add annotations to the text and publish the results.

ocr4all http://www.ocr4all.org/ 

OCR4all is software designed for optical recognition of antique prints – also those whose complicated printing types and uneven layout are beyond the recognition of most other OCR programs. OCR4all combines various tools in one consistent interface so that segmentation, recognition and transcription are possible in one place.

Transkribus https://readcoop.eu/transkribus/ 

Transkribus is probably the most interesting platform, offering the most possibilities and allowing the transcription of both printed and handwritten documents. It includes a range of automated processing tools such as OCR, handwritten text recognition, layout analysis, handwriting comprehension and recognition. All Transkribus services are available via the web interface and are provided free of charge. Transkribus can be trained to recognize documents in the languages we are interested in: Arabic, English, Old German, Polish, Bengali, Hebrew or Dutch. Each user gets a package of 500 free transcription sites.

In practice…

Not all of these programs have a clear and intuitive interface. Fortunately, on each of the pages of individual projects you can find a list of manuals, and recordings of webinars that answer most of the questions, where instructors go through all the stages of working with the manuscript and its transcription program.
After uploading photos or scans of documents, we need to face the segmentation of the text. At the beginning, we designate the areas where the text is located, we separate the main body from the margins, we check whether the program has correctly extracted the individual lines.
Then we move on to transcription. Each line from the scan is assigned a numbered line in the text editor section. We must follow the old text: rewrite it character by character, in accordance with the original. Since there were no uniform spelling rules in the past, correct spelling or grammar will be of secondary importance here. The words should be separated or combined according to the original text, even if it is not according to current practice.
In order to teach the program to recognize texts, we have to transcribe the first pages ourselves. If it is a handwritten text, it takes between five and fifteen thousand words to be transcribed for the program to be able to analyze the rest on its own – so that the transcription of the text we have loaded requires only corrections on our part.
The program learns with us, so after the work is finished, we can send the analyzed texts to the database. The more users and fonts there are in the database, the easier it is to work with subsequent documents.