Article Image

Cover: Mekka Bâb ʻAlī, Tor der heiligen Moschee an der Ostecke; durch das mittlere Portal wird das Zemzemhaus sichtbar. (Library of Congress, LC-DIG-pmsca-38168)



Please, use the latest version at https://alraqmiyyat.github.io/


By: Maxim Romanov, Matthew Thomas Miller,
Sarah Bowen Savant, and Benjamin Kiessling



The OpenITI team—building on the foundational open-source OCR work of the Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities—has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines (~400 pages, 87,000 words). These accuracy rates not only represent a distinct improvement over the actual accuracy rates of the various proprietary OCR options for classical Arabic-script texts, but, equally important, they are produced using an open-source OCR software called Kraken (developed by Benjamin Kiessling, LU), thus enabling us to make this Arabic-script OCR technology freely available to the broader Islamic, Persian, and Arabic Studies communities in the near future. Unlike more traditional OCR approaches, Kraken relies on a neural network—which mimics the way we learn—to recognize letters in the images of entire lines of text without trying first to segment lines into words and then words into letters. This segmentation step—a mainstream OCR approach that persistently fails on connected scripts—is thus completely removed from the process, making Kraken uniquely powerful for dealing with a diverse variety of ligatures in connected Arabic script. In the process we also generated over 7,000 lines of “gold standard” (double-checked) data that can be used by others for Arabic-script OCR training and testing purposes.

Our working paper can be found on Academia.edu.



Kraken ibn Ocropus. Based on a depiction of an octopus from a manuscript of Kitāb al-ḥašāʾiš fī hāyūlā al-ʿilāj al-ṭibbī (Leiden, UB : Or. 289); special thanks to Emily Selove for help with finding an octopus in the depths of the Islamic MS tradition.


Please, use the latest version at https://alraqmiyyat.github.io/


Posted by Maxim Romanov

Research fellow (PhD in Near Eastern Studies, U of Michigan, 2013) at the Humboldt Chair for Digital Humanities [Institut für Informatik], University of Leipzig. He studies Islamic historical texts with computational methods, currently focusing on the analysis of multivolume biographical and bibliographical collections.

Comments

comments powered by Disqus