Who Advertises in Newspapers? Data Criticism in Mining Historical Job Ads
Abstract
Digitized newspapers are a source of unique and rich historical data but pose significant challenges in the interpretation of results obtained through their mining. The JobAds project (FWF P35783) explores the evolution of the labor market through job advertisements from digitized newspapers between 1850-1950, aiming to reveal regional and temporal trends in job offers, required skills, media strategies, and social aspects such as gender-specific ads. Using the ANNO corpus, we selected 29 newspapers with the most editions. Their processing involved job ads pages preselection, layout segmentation, optical character recognition (OCR), and post-correction, each introducing potential biases due to varying efficiency of these processes. Additionally, the inherent bias of newspapers as historical sources must be considered, as they reflect only a subset of the job market dynamics of their time. This paper identifies these biases, quantifies their impact, and proposes solutions for steps from corpus selection to data preparation for subsequent text-mining and analysis. We discuss and exemplify the implications of these biases on research outcomes and suggest methodological adjustments to mitigate their effects, ensuring more reliable insights into the historical labor market. Also, we make a dataset of 15 000 manually annotated ground-truth data available as part of this paper.