A geolocated dataset of German news articles
Keywords:
News data, Natural language processing, GeographyAbstract
The emergence of large language models and the exponential growth of digitized text data have revolutionized research methodologies across a broad range of social sciences. News articles are an important source of digitized text data in this context. News data is crucial for the social sciences as it provides real-time insights into public discourse and societal trends, helping to understand various social phenomena and dynamics. However, most research involving news data is conducted at the national level, as geographically more granular news data is often unavailable. In this paper, we address this gap by providing insights into how news articles can be geolocated and how the texts can then be further analyzed. More specifically, we collect data from the CommonCrawl News dataset and clean the text data for further analysis. We then use a named-entity recognition model for geocoding, linking news articles to geographic locations. Finally, we transform the news articles into text embeddings using SBERT, enabling semantic searches within the news data corpus. In the paper, we apply this process to all German news articles and make the German location data, as well as the embeddings, available for download. As a result, we compile a dataset containing text embeddings for about 50 million German news articles, of which about 70% include geographic locations. The process can be replicated for news data from other countries, as we provide all code and workflows.