Bringing Rome to Life: Evaluating Historical Image Generation
Abstract
This study evaluates the potential of AI image generation for visualising historical events, focusing on two ancient Roman scenarios: the Roman triumph and the Lupercalia festival. Using DALL-E 3, we generated 600 images based on 100 prompts derived from scientific texts. We then conducted a two-part evaluation: (1) A human evaluation by 21 history students, who compared image pairs and rated individual images on accuracy and prompt alignment, and (2) two automated analyses, one modelled after the human evaluation protocol and one using visual question-answering (VQA) techniques. Our results reveal both the promise and limitations of AI in historical visualisation. While DALL-E 3 produced many convincing images, there were notable discrepancies between human and automated assessments. We found that Large Language Models tend to rate images more favourably than human evaluators. We contribute a novel dataset for historical image generation, initial human and automated evaluation protocols, and insights into the challenges of using AI for historical visualisation, which is incredibly important for historians to reconstruct past events. Our findings highlight the need for refined evaluation methods and underscore the complexity of assessing historical accuracy in AI-generated imagery. This study lays the groundwork for future research on improving AI models for historical visualisation and developing more robust evaluation frameworks.