Bridging the gap between digital humanities and natural language processing: Text cleaning and NLP accuracy evaluation of a sample of 20th century Romanian novels

Bogdan Vătavu; David Morariu

doi:10.35824/sjrs.v9i2.28654

Authors

Bogdan Vătavu Lucian Blaga University of Sibiu, Romania https://orcid.org/0000-0001-7997-5625
David Morariu Lucian Blaga University of Sibiu, Romania https://orcid.org/0000-0001-5941-2200

DOI:

https://doi.org/10.35824/sjrs.v9i2.28654

Keywords:

computational linguistics, part of speech tagging, lemmatization, natural language processing, digital humanities, Romanian language corpus, Romanian literary texts

Abstract

Although Digital Humanities (DH) and Natural Language Processing (NLP) share common ground, the two academic communities around these disciplines rarely benefit from each other. This study aims to bridge the gap between the two fields of study, by attempting to establish whether off-the-shelf NLP tools are suitable for analysing DH-oriented textual corpora. Our case study focuses on The Digital Museum of the Romanian Novel (DMRN), a digital archive which includes most of the novels in Romanian literature up until 1947. Currently, the archive is composed of more than 1200 digital files in the PDF format, derived from their original print editions. An optical character recognition (OCR) layer has also been added in the process, which opens the possibility of applying NLP techniques to these texts. On the other hand, OCRed texts are notoriously difficult to process, being laden with spelling errors and a significant amount of noise. Our case study had two main objectives: 1) to devise an automated method for cleaning the large quantities of OCRed literary texts found in the archive DMRN, in order to (2) analyse the texts by means of NLP with two of the currently available off-the-shelf models, namely spaCy and Stanza. The sample of our study consists of 15 novels published between 1933 and 1947. The extracted texts have been cleaned by employing a custom Python script and further sampled for NLP tasks such as tokenization, part of speech tagging and lemmatization. We then assessed the accuracy of the results of each separate task through manual validation. Overall, our methodology has proven to be efficient and time-saving with regard to the automatic cleaning process, but also satisfactory in what concerns the accuracy percentages of the performed NLP tasks.

Author Biographies

Bogdan Vătavu, Lucian Blaga University of Sibiu, Romania

https://orcid.org/0000-0001-7997-5625

https://www.webofscience.com/wos/author/record/POV-1988-2026

https://www.scopus.com/authid/detail.uri?authorId=58029299000

Bogdan Vătavu, a Lecturer/Assistant Professor at Lucian Blaga University in Sibiu, teaches various disciplines in the field of Library and Information Science. He earned his PhD in History from Babeș-Bolyai University Cluj-Napoca, Romania, in 2019. His academic interests encompass social history, library and information science, and the digital humanities. His research focuses on topics such as the history of crime in Romania and its representations in popular culture, digital collections and text mining, and the state of Romanian librarianship. Additionally, he has experience as a librarian and has worked for several years at the Octavian Goga Cluj County Library.

David Morariu, Lucian Blaga University of Sibiu, Romania

https://orcid.org/0000-0001-5941-2200

https://www.webofscience.com/wos/author/record/IAO-6861-2023

https://www.scopus.com/authid/detail.uri?authorId=57200625552

David Morariu is a PhD candidate at Lucian Blaga University of Sibiu and teaching assistant with the Department of Romance Studies, Faculty of Letters and Arts. His research interests include literary discourse analysis and the linguistic strategies of othering in the Romanian novel of the nineteenth and twentieth centuries. Drawing on frameworks such as corpus-assisted discourse studies, the discourse-historical approach and corpus-based literary onomastics, his doctoral research examines linguistic stereotyping and name-based discrimination of Roma characters in the Romanian novel published between 1845 and 1947. He was part of the digitization team for the research project The Digital Museum of the Romanian Novel (1845-1947) and has also been involved in other research projects focusing on emerging paradigms in teaching Romanian as a foreign language. He is currently a research assistant in the project CORECON, which examines the coverage of the Russian-Ukrainian conflict in Polish-, Romanian- and English-language media.

References

Amza, G. M., & Bilciurescu, A. (1938). Vampirul [The vampire]. Editura Librăriei Principele Mircea.

Anesiea, I. (1941). Zmeu de mare [Sea dragon] (vol. 1). Tip. Vremea.

Baghiu, Ș. (2018). Strong domination and subtle dispersion: A distant reading of novel translation in communist Romania (1944–1989). In M. Sass, Ș. Baghiu & V. Pojoga (Eds.), The culture of translation in Romania (pp. 65–87). Peter Lang.

Baghiu, Ș., Pojoga, V., Bâlici, M., Chiorean, M., Ciorogar, A., Codină-Brenda, J., Crăciun, B., Farmatu, T., Mîrț, A., Morariu, D., Olaru, O., Rădescu, C., Savin, A., Stanislav, C., Stoica, A.-M., Strugari, N., Terian, A., Ung, S., Vancu, R., … Văsieș, A. (2021). Muzeul digital al romanului românesc: 1933–1947 [The digital museum of the Romanian novel: 1933–1947] [Data set]. Complexul Național Muzeal ASTRA. https://revistatransilvania.ro/mdrr

Baghiu, Ș., Pojoga, V., Borza, C., Coroian-Goldiș, A., Gârdan, D., Modoc, E., Susarenco, T., Vancu, R., & Varga, D. (2019a). Muzeul digital al romanului românesc: Secolul al XIX-lea [The digital museum of the Romanian novel: The 19th century] [Data set]. Complexul Național Muzeal ASTRA. https://revistatransilvania.ro/mdrr

Baghiu, Ș., Pojoga, V., & Sass, M. (Eds.). (2019b). Ruralism and literature in Romania. Peter Lang.

Bart J. (1933). Europolis. Editura Adeverul.

Bâlici, M. (2018). The emergence of quantitative studies: Actual functionalities and the Romanian case. Metacritic Journal for Comparative Studies and Theory, 4(2), 54–71.

Bâlici, M. (2019). Studii cantitative recente în spațiul românesc. Între analiză instituțională și problema traducerilor [Recent quantitative studies in Romania: Between institutional analysis and the topic of translation]. Transilvania, 2, 11–18.

Barbu Mititelu, V., Irimia, E., & Tufiș, D. (2014). CoRoLa—The Reference Corpus of Contemporary Romanian Language. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1235–1239. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/360_Paper.pdf

Barbu Mititelu, V., Irimia, E., Păiș, V., Avram, A.-M., Mitrofan, M., & Curea, E. (2020). Romanian resources in linguistic linked open data format. The 15th International Conference on Linguistic Resources and Tools for Natural Language Processing, 29–40.

Barnoschi, D. V. (1942). Stăvilare. Se aleg apele [Floodgates. The waters clear]. Editura Națională Gh. Mecu.

Borza, C. (2019). How to populate a country: A quantitative analysis of the rural novel from Romania (1900–2000). In Ş. Baghiu, V. Pojoga, & M. Sass (Eds), Ruralism and literature in Romania (pp. 21–39). Peter Lang.

Busuioc, M., & Caragea, D. (2019). ROMTEXT: Flux de obținere și tratare a textelor [Romtext: Flow for electronic texts’ generation and treatment]. Studii și Cercetari Lingvistice, 70(1), 118–123. Scopus.

Ciorogar, A., & Modoc, E. (2019, February 8). Analiza computațională în cadrul studiilor literare românești [Computational analysis in Romanian literary studies]. Observator Cultural, 981. https://www.observatorcultural.ro/articol/analiza-computationala-in-cadrul-studiilor-literare-romanesti/

Ciorogar, A., Modoc, E., Goldiș, A., Mudure, M., & Ursa, M. (2019, January). Studiile cantitative și provocările lor [Quantitative studies and their challenges]. Cultura, 1(597), 33–39.

Cohen, M. (2002). The sentimental education of the novel (2nd. ed.). Princeton University Press.

Coroian-Goldiș, A., Gârdan, D., Morariu, D., Borza, C., Modoc, E., & Susarenco, T. (2019). Arhivele romanului românesc și posibilități de digitizare [The archives of the Romanian novel and digitization possibilities]. Revista Transilvania, 10, 1–8.

Cotovu, S. (1947). Vijelie [Storm]. Editura Casa Școalelor.

Damrosch, D. (2006). World literature in a postcanonical, hypercanonical age. In H. Saussy (Ed.), Comparative literature in an age of globalization (pp. 43–53). Johns Hopkins University Press. https://www.academia.edu/download/38530018/WL_in_a_Postcanonical_Age.pdf

Del-Vet, A. (1936). Două lumi [Two worlds]. Editura Librăria Academică.

Dessila, O. (1946). Porți fără număr [Unnumbered gates] (vol. 2). Editura Cartea Românească.

Dinu, L. P., Niculae, V., & Sulea, M.-O. (2012). Pastiche detection based on stopword rankings: Exposing impersonators of a Romanian writer. In E. Fitzpatrick, J. Bachenko & T. Fornaciari (Eds.), Proceedings of the Workshop on Computational Approaches to Deception Detection, 72–77. Association for Computational Linguistics. https://aclanthology.org/W12-0411

Est, E. (1939). Zaza. Editura Țicu I. Eșanu.

Explosion. (n.d.-a). Models and Languages. https://spaCy.io/usage/models

Explosion. (n.d.-b). Multi-language. https://spaCy.io/models/xx.

Florescu, M. (1945). Voluntarii [The volunteers]: vol. 1. Spre Spania [Toward Spain]. Editura Scânteia.

Gârdan, D. (2018). Mapping emotions in the Romanian erotic novel of the interwar period: Canonical affect and popular sensibility. Dacoromania Litteraria, 5(1), 101–114.

Gârdan, D. (2019). Interstitial spatiality in the Romanian novel of the interwar period: Mute rurality and subverted urbanity. In Ş. Baghiu, V. Pojoga & M. Sass (Eds), Ruralism and literature in Romania (pp. 69–80). Peter Lang.

Gârdan, D., & Modoc, E. (2020). Mapping literature through quantitative instruments: The case of current Romanian literary studies. Interlitteraria, 25(1), Article 1. https://doi.org/10.12697/IL.2020.25.1.6

Gavrilă, V., Băjenaru, L., Dobre, C., & Tomescu, M. (2021). Towards the development of a Romanian lexicon for the analysis of emotions in the literary works of canonical authors. Studies in Informatics and Control, 30(2), 111–120. https://doi.org/10.24846/v30i2y202110

Généreux, M., & Spano, D. (2015). NLP challenges in dealing with OCR-ed documents of derogated quality. Workshop Proceedings ‘Replicability and Reproducibility in Natural Language Processing: Adaptive Methods, Resources and Software’ at IJCAI 2015, 6. https://www.researchgate.net/publication/281112670_NLP_challenges_in_dealing_with_OCR-ed_documents_of_derogated_quality

Goldiș, A. (2014). Digital humanities – o nouă paradigmă teoretică? [Digital humanities – A new theoretical paradigm?]. Revista Transilvania, 12, 1–4.

Goldiș, A., & Modoc, E. (2020). Distant Reading – o nouă paradigmă de cercetare literară? [Distant reading – A new literary research paradigm?]. Revista Vatra, 8–9, 44–121.

Graur, A. (2009). Mic tratat de ortografie [Brief orthography treatise] (L. Groza, Ed.). Humanitas.

Hanu, B., Vlad, A., & Mitrea, A. (2018). Aspects revealing the orthography and punctuation impact in printed Romanian: A literary corpus based study. 2018 International Conference on Communications (COMM), 95–100. https://doi.org/10.1109/ICComm.2018.8484819

Ion, R., Irimia, E., & Barbu Mititelu, V. (2018). Ensemble Romanian dependency parsing with neural networks. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 1574–1579.

Istrate, I., Milea, I., Modola, D., Pop, A., Popa, M., Sasu, A., Stan, E., Taşcu, V., & Vartic, M. (2004). Dicționarul cronologic al romanului românesc de la origini până la 1989: DCRR [The chronological dictionary of the Romanian novel from its origins to 1989]. Editura Academiei Române.

Kim, A., Pethe, C., Inoue, N., & Skiena, S. (2021). Cleaning dirty books: Post-OCR processing for previously scanned texts. Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021, 4217–4226. Association for Computational Linguistics.

McGillivray, B., Poibeau, T., & Fabo, P. R. (2020). Digital humanities and natural language processing: “Je t’aime... Moi non plus”. Digital Humanities Quarterly, 14(2). https://doi.org/10.17863/CAM.55816

Mieskes, M., & Schmunk, S. (2019). OCR quality and NLP preprocessing. In A. Axelrod, D. Yang, R. Cunha, S. Shaikh & Z. Waseem (Eds.), Proceedings of the 2019 Workshop on Widening NLP (pp. 102–105). Association for Computational Linguistics. https://aclanthology.org/W19-3633

Missir, I. (1937). Fata moartă [The dead girl]. Editura Cartea Românească.

Modoc, E. (2018). Traveling avant-gardes: The case of futurism in Romania. In Ş. Baghiu, V. Pojoga & M. Sass (Eds), The culture of translation in Romania (pp. 47–65). Peter Lang.

Modoc, E. (2020). Internaţionala periferiilor: Reţeaua avangardelor din Europa Centrală şi de Est [The international of peripheries: Avant-garde networks of East-Central Europe]. Editura Muzeul Literaturii Române.

Modoc, E., & Gârdan, D. (2020). Style at the scale of the canon: A stylometric analysis of 100 Romanian novels published between 1920 and 1940. Metacritic Journal for Comparative Studies and Theory, 6(2), 48–63. https://doi.org/10.24193/mjcst.2020.10.03

Montani, I., Honnibal, M., Honnibal, M., Boyd, A., Van Landeghem, S., & Peters, H. (2023). Explosion/spaCy: V3.7.2: Fixes for APIs and requirements (Version v3.7.2) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.10009823

Morariu, D., Stoica, A.-M., Farmatu, T., Vancu, R., & Varga, D. (2021). Poli de producție ai romanului românesc (1933-1947): Rețele editoriale și forme de canonizare [Production centers of the Romanian novel (1933-1947): Editorial networks and forms of canonization]. Revista Transilvania, 9, 35–42. https://doi.org/10.51391/trva.2021.09.05

Moretti, F. (2005). Graphs, maps, trees: Abstract models for a literary history. Verso.

Neagu, L. M., Dascalu, M., Trausanmatu, S., Simion, E., & Chisu, L. (2020). Automated modeling of Romanian literary trends in history using topics over time and co-occurences. Proceedings of the 16th International Scientific Conference ‘eLearning and Software for Education’ Bucharest, April 30 - May 1, 2020, 151–158. https://doi.org/10.12753/2066-026X-20-019

Nicolaescu, M., & Mihai, A. (2014). Teaching digital humanities in Romania. CLCWeb: Comparative Literature and Culture, 16(5). https://doi.org/10.7771/1481-4374.2497

Nunberg, G. (1990). The linguistics of punctuation. Center for the Study of Language and Information.

Olaru, O. (2019). What is digital humanities and what’s it doing in Romanian departments? Revista Transilvania, 2019(5–6), 30–37.

Păiș, V., Ion, R., Avram, M., Mitrofan, M., & Tufiș, D. (2021). In-depth evaluation of Romanian natural language processing pipelines. Romanian Journal of Information Science and Technology, 24(4), 384–401.

Palmer, D. D. (2010). Text preprocessing. In N. Indurkhya & F. J. Damerau, Handbook of natural language processing (2nd ed., pp. 9–30). Chapman and Hall/CRC.

Patraş, R., Galleron, I., Grădinaru, C., Lionte, I., & Pascaru, L. (2019). The splendors and mist(eries) of Romanian digital literary studies: A state-of-the-art just before Horizons 2020 closes off. Hermeneia, 23, 207–222.

Patraș, R., Galleron, I., Pascariu, L., Olteanu, A., Lionte, I., & Grădinaru, C. (2020). HAIRO: corpus de romane haiducești (1850-1950) [HAIRO: Corpus of hajduk novels (1850-1950)] [Data set]. https://www.nakala.fr/page/collection/11280/51a881d1

Petrescu, C. (1943). 1907: vol. 3. Pământ, mormânt… [Land, grave…]. Cugetarea – Georgescu Delafras.

Pojoga, V. (2018). A survey of poetry translations in Romanian periodicals (1990–2015). In M. Sass, Ş. Baghiu & V. Pojoga (Eds), The culture of translation in Romania (pp. 99–121). Peter Lang.

Pojoga, V., Baghiu, Ş., Modoc, E., Gârdan, D., & Coroian Goldiș, A. (2019). Tehnici digitale pentru analiza romanului românesc [Digital tools for the analysis of the Romanian novel]. Revista Transilvania, 10, 9–16. https://revistatransilvania.ro/wp-content/uploads/2019/12/02.Vlad-Pojoga-Stefan-Baghiu-Emanuel-Modoc-Daiana-Gardan-Andreea-Coroian-Goldis.pdf

Pojoga, V., Neagu, L. M., & Dascalu, M. (2020). The character network in Liviu Rebreanu’s Ion: A quantitative analysis of dialogue. Metacritic Journal for Comparative Studies and Theory, 6(2), 23.

Popescu, M., & Dinu, L. P. (2008). Rank distance as a stylistic similarity. In D. Scott & H. Uszkoreit (Eds), Coling 2008: Companion volume: Posters (pp. 91–94). Coling 2008 Organizing Committee. https://aclanthology.org/C08-2023

Poppler (n.d.) [Computer software]. https://poppler.freedesktop.org/

Prinsloo, D., Taljard, E., & Goosen, M. (2022). Optical character recognition and text cleaning in the indigenous South African languages. Stellenbosch Papers in Linguistics Plus, 64, 165–187. https://doi.org/10.5842/64-1-867

Python Software Foundation. (n.d.). difflib – Helpers for computing deltas. https://docs.python.org/3/library/difflib.html

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A python natural language processing toolkit for many human languages. In A. Celikyilmaz & T.-H. Wen (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 101–108). https://doi.org/10.18653/v1/2020.acl-demos.14

Rebreanu, L. (1940). Amândoi [Both of them]. Editura Socec & Co.

Sass, M., Baghiu, Ş., & Pojoga, V. (2018). The culture of translation in Romania. Peter Lang.

Stanford NLP Group. (n.d.-a). Language identification. https://stanfordnlp.github.io/stanza/langid.html

Stanford NLP Group. (n.d.-b). Model performance. https://stanfordnlp.github.io/stanza/performance.html

Stănoiu, D. (1934). Parada norocului [The parade of fortune]. Editura Librăriei Universala Alcalay & Co.

Strange, C., McNamara, D., Wodak, J., & Wood, I. (2014). Mining for the meanings of a murder: The impact of OCR quality on the use of digitized historical newspapers. Digital Humanities Quarterly, 8(1). https://digitalhumanities.org/dhq/vol/8/1/000168/000168.html

Streinul, M. (1944). Băieți de fată [Illegitimate sons]. Editura ABC.

Take. (1935). Nunta Lizetii [Lizeta’s wedding]. Editura Cartea Românească.

Terian, A., Baghiu, Ș. (2025). Digitising literary heritage: Some lessons from the Digital Museum of the Romanian Novel. In T. Lähdesmäki, J. Turunen, A. Terian & R. Garcia-Bardidia (Eds.), Engaging communities in cultural heritage (pp. 113-127). Routledge.

Terian, A. (2019). Big numbers: A quantitative analysis of the development of the novel in Romania. Transylvanian Review, XXVIII (Suppl. 1), 55–74.

Terian, A., Farmatu, T., Borza, C., Varga, D., Văsieș, A., & Morariu, D. (2021). Genurile romanului românesc (1933-1947). O analiză cantitativă [The genres of the Romanian novel (1933-1947): A quantitative analysis]. Transilvania, 9, 43–54. https://doi.org/10.51391/trva.2021.09.06

Tufiș, D., & Dan, C. (2018). A bird’s-eye view of language processing projects at the Romanian Academy. In N. C. (Conference chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis & T. Tokunaga (Eds), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA).

Unicode. (n.d.). Unicode normalization FAQ. https://unicode.org/faq/normalization.html

Universal Dependecies. (n.d.-a). Statistics of UD Romanian RRT. https://universaldependencies.org/treebanks/ro_rrt/index.html#statistics-of-ud-romanian-rrt

Universal Dependencies. (n.d.-b). UD for Romanian. https://universaldependencies.org/ro/index.html

Universal Dependencies. (n.d.-c). Universal Dependencies. https://universaldependencies.org/

Ursa, M. (2015). Is Romanian culture ready for the digital turn?. Metacritic Journal for Comparative Studies and Theory, 1, 80–97.

Van Strien, D., Beelen, K., Ardanuy, M., Hosseini, K., McGillivray, B., & Colavizza, G. (2020). Assessing the impact of OCR quality on downstream NLP tasks. Proceedings of the 12th International Conference on Agents and Artificial Intelligence, 484–496. https://doi.org/10.5220/0009169004840496

Vătavu, B., & Morariu, D. (2024). Enhancing usability of digital collections: Accuracy assessment and OCR post-correction of the Digital Museum of the Romanian Novel. Transilvania, 10, 66–75. https://doi.org/10.51391/trva.2024.10.08

Vintilă-Rădulescu, I., Rădulescu Sala, M., & Aranghelovici, C. (2021). Dicţionarul ortografic, ortoepic şi morfologic al limbii române: DOOM (Ediţia a 3-a revăzută şi adăugită) [Orthographic, orthoepic and morphological dictionary of the Romanian language. Revised and updated, 3rd Edition]. Univers Enciclopedic Gold.

Xiao, R. (2010). Corpus creation. In N. Indurkhya & F. J. Damerau (Eds), Handbook of natural language processing (2nd ed., pp. 147–165). Chapman and Hall/CRC.

Bridging the gap between digital humanities and natural language processing: Text cleaning and NLP accuracy evaluation of a sample of 20th century Romanian novels

Authors

DOI:

Keywords:

Abstract

Author Biographies

Bogdan Vătavu, Lucian Blaga University of Sibiu, Romania

David Morariu, Lucian Blaga University of Sibiu, Romania

References

Downloads

Published

How to Cite

Issue

Section

License

Information