Steps Towards Mining Manuscript Images for Untranscribed Texts: A Case Study From the Syriac Collection at the Vatican Library

Abstract

Digital libraries and databases of texts are invaluable resources for researchers, yet their reliance on printed editions can lead to significant gaps and potentially exclude works without printed reproductions. The Simtho database of Syriac serves as a pertinent example: it is derived primarily from OCR of scholarly editions, but how representative are these of the language's extensive literary tradition, transmitted and preserved in manuscript form for centuries? Taking the Simtho database and a selection of the Vatican Library's Syriac manuscript collection as a case study, we propose a pipeline that aligns a corpus of e-texts with a set of digitised manuscript images, in order to ascertain the presence or absence of texts between the e-text and manuscript corpora and thus contribute to their enrichment. We delve into the complexities of this task, evaluating both effective tools for alignment and approaches to detect factors that can contribute to alignment failures. This case study is intended as a first step towards foundational methodologies applicable to larger-scale manuscript processing efforts.