A geolocated dataset of German news articles

Authors

  • Lukas Kriesch
  • Sebastian Losacker

Keywords:

News data, Natural language processing, Geography

Abstract

The emergence of large language models and the exponential growth of digitized text data have revolutionized research methodologies across a broad range of social sciences. News articles are an important source of digitized text data in this context. News data is crucial for the social sciences as it provides real-time insights into public discourse and societal trends, helping to understand various social phenomena and dynamics. However, most research involving news data is conducted at the national level, as geographically more granular news data is often unavailable. In this paper, we address this gap by providing insights into how news articles can be geolocated and how the texts can then be further analyzed. More specifically, we collect data from the CommonCrawl News dataset and clean the text data for further analysis. We then use a named-entity recognition model for geocoding, linking news articles to geographic locations. Finally, we transform the news articles into text embeddings using SBERT, enabling semantic searches within the news data corpus. In the paper, we apply this process to all German news articles and make the German location data, as well as the embeddings, available for download. As a result, we compile a dataset containing text embeddings for about 50 million German news articles, of which about 70% include geographic locations. The process can be replicated for news data from other countries, as we provide all code and workflows.

Published

2025-02-26

Issue

Section

Working papers