Arabographic Optical Character Recognition (OCR)

less than 1 minute read

The OpenITI team—building on the foundational open-source OCR work of the Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities—has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines (~400 pages, 87,000 words). These accuracy rates not only represent a distinct improvement over the actual accuracy rates of the various proprietary OCR options for classical Arabic-script texts, but, equally important, they are produced using an open-source OCR software called Kraken (developed by Benjamin Kiessling, LU), thus enabling us to make this Arabic-script OCR technology freely available to the broader Islamic, Persian, and Arabic Studies communities in the near future. Unlike more traditional OCR approaches, Kraken relies on a neural network—which mimics the way we learn—to recognize letters in the images of entire lines of text without trying first to segment lines into words and then words into letters. This segmentation step—a mainstream OCR approach that persistently fails on connected scripts—is thus completely removed from the process, making Kraken uniquely powerful for dealing with a diverse variety of ligatures in connected Arabic script. In the process we also generated over 7,000 lines of “gold standard” (double-checked) data that can be used by others for Arabic-script OCR training and testing purposes.

Our working paper can be found on Academia.edu (By: Benjamin Kiessling, Matthew Thomas Miller, Maxim Romanov, Sarah Bowen Savant).

image-right

Kraken ibn Ocropus. Based on a depiction of an octopus from a manuscript of Kitāb al-ḥašāʾiš fī hāyūlā al-ʿilāj al-ṭibbī (Leiden, UB : Or. 289); special thanks to Emily Selove for help with finding an octopus in the depths of the Islamic MS tradition.

Share on

Twitter Facebook Google+ LinkedIn

EIS1600: Emmy Noether Junior Research Group (DFG)

1 minute read :: Posted on June 8, 2023

In the course of the first millennium of its history (c. 600–1600 CE), Islamic society evolved from a simple tribal polity into a multifaceted social, cultural, and political entity that stretched from Spain and North Africa in the West to Central Asia and India in the East. Arabic chronicles and biographical collec... more...

The Network of MESA (2009–2017)

26 minute read :: Posted on November 12, 2017

The Middle East Studies Association (MESA) celebrated its 50th anniversary last year. Although it is not as large as such associations as AAR and AHA, it is very dear to most of us who are engaged in the study of Middle East. Those who attended the annual meeting in Boston must have seen an attempt to visualize acad... more...

A Digital Humanities for Premodern Islamic History

7 minute read :: Posted on October 18, 2017

Defining digital humanities is tricky. Our scholarship has been intrinsically digital for quite a few decades already, as we rely more and more on electronic storage to save, word processors to write, bibliography managers to organize, databases to consult, digital libraries to search and read. Living in the digital... more...

Cultural Production in the Islamic World

8 minute read :: Posted on October 14, 2017

Biographical and bibliographical texts can offer a valuable insight into the process of cultural production in the Islamic world. One of the most relevant texts is the Hadiyyaŧ al-ʿārifīn (“The Gift to the Knowledgeable”)—a bio-bibliographical collection written by Ismāʿīl Bāšā al-Baġdādī (d. 1338/1919 CE). Although... more...

Maxim Romanov

Share on

You may also find interesting

EIS1600: Emmy Noether Junior Research Group (DFG)

The Network of MESA (2009–2017)

A Digital Humanities for Premodern Islamic History

Cultural Production in the Islamic World