From cipher to plain text?

Topic modelling Swedish governmental reports, 1945–1989


  • Pelle Snickars



digital humanities, digital history, topic modelling, media history, Swedish Governmental Official Reports (SOU)


In 2015 the National Library of Sweden finished digitising all Governmental Official Reports (SOU) from 1922 to 1999. Traditionally, SOU reports – and work performed within different governmental committees – were tasked with preparing the Swedish government for apt and rational decision-making. The range of subjects covered by governmental committees and SOU reports basically includes every area of the Swedish welfare state, from issues focused on migration and the environment to cultural and media policy.

The article departs from an analysis of all SOU reports during 1945–89 as one massive dataset; in all 3,154 SOU reports that contain 87 million tokens. Research has been performed within a Jupyter Lab environment, a web application with executable Python code that can be run to perform data analysis. The Jupyter Lab environment has been developed at the digital humanities hub Humlab at Umeå University, and research is related to the project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–89. This is a digital humanities and digital history project that will digitise literature, curate already digitised collections, and perform research via probabilistic methods and text mining models.

If all SOU reports were to be considered one single text written by the state, which themes in this vast text is software able to read and perceive? It is possible to answer such a broad question by way of topic modelling, a computational method to study themes in texts by accentuating words that tend to co-occur and together create different topics. Via co-occurrence, topic modelling creates topics in the form of clusters of similar words (topics); a term or a word may be a part of several topics with different degrees of probability. Topics also occur in relation to each other, and clusters and networks can be visualised by using software such as Gephi.

The article focuses on topics related to media and media policy. Depending on how many topics a topic model displays – in the article models of 50, 100, 200, and 500 topics are used – different media topics can be detected. In the 50 model, one media topic was found, whereas in the 500 model, there were several, with more specific traits such as film censorship or daily press subsidies. One finding is that film was the single medium to which the SOU genre between 1945–89 devoted the most attention. Another finding is that archival issues were closely linked to media topics during the same period. Governmental committees and SOU reports on media were primarily focused on future-oriented policies, above all how media should be supported or regulated. Yet, archiving the same media forms was also something that the state was repeatedly interested in.

In conclusion, the article explains what topic modelling is in general, how the method can be used in digital historical research – not least in relation to close reading – and how statistical analysis of the distribution of words in the form of topics can generate interesting results. The SOU data is rich; topics can be traced with many different themes. As a researcher, however, one must learn to work with data: to load different models into the Jupyter Lab environment, to compute various input values, change parameters, and often curate outcomes in a way that differs from traditional historical research practices.

Digital illustrations: