Bridging the gap between digital humanities and natural language processing: Text cleaning and NLP accuracy evaluation of a sample of 20th century Romanian novels
DOI:
https://doi.org/10.35824/sjrs.v9i2.28654Keywords:
computational linguistics, part of speech tagging, lemmatization, natural language processing, digital humanities, Romanian language corpus, Romanian literary textsAbstract
Although Digital Humanities (DH) and Natural Language Processing (NLP) share common ground, the two academic communities around these disciplines rarely benefit from each other. This study aims to bridge the gap between the two fields of study, by attempting to establish whether off-the-shelf NLP tools are suitable for analysing DH-oriented textual corpora. Our case study focuses on The Digital Museum of the Romanian Novel (DMRN), a digital archive which includes most of the novels in Romanian literature up until 1947. Currently, the archive is composed of more than 1200 digital files in the PDF format, derived from their original print editions. An optical character recognition (OCR) layer has also been added in the process, which opens the possibility of applying NLP techniques to these texts. On the other hand, OCRed texts are notoriously difficult to process, being laden with spelling errors and a significant amount of noise. Our case study had two main objectives: 1) to devise an automated method for cleaning the large quantities of OCRed literary texts found in the archive DMRN, in order to (2) analyse the texts by means of NLP with two of the currently available off-the-shelf models, namely spaCy and Stanza. The sample of our study consists of 15 novels published between 1933 and 1947. The extracted texts have been cleaned by employing a custom Python script and further sampled for NLP tasks such as tokenization, part of speech tagging and lemmatization. We then assessed the accuracy of the results of each separate task through manual validation. Overall, our methodology has proven to be efficient and time-saving with regard to the automatic cleaning process, but also satisfactory in what concerns the accuracy percentages of the performed NLP tasks.
References
Amza, G. M., & Bilciurescu, A. (1938). Vampirul [The vampire]. Editura Librăriei Principele Mircea.
Anesiea, I. (1941). Zmeu de mare [Sea dragon] (vol. 1). Tip. Vremea.
Baghiu, Ș. (2018). Strong domination and subtle dispersion: A distant reading of novel translation in communist Romania (1944–1989). In M. Sass, Ș. Baghiu & V. Pojoga (Eds.), The culture of translation in Romania (pp. 65–87). Peter Lang.
Baghiu, Ș., Pojoga, V., Bâlici, M., Chiorean, M., Ciorogar, A., Codină-Brenda, J., Crăciun, B., Farmatu, T., Mîrț, A., Morariu, D., Olaru, O., Rădescu, C., Savin, A., Stanislav, C., Stoica, A.-M., Strugari, N., Terian, A., Ung, S., Vancu, R., … Văsieș, A. (2021). Muzeul digital al romanului românesc: 1933–1947 [The digital museum of the Romanian novel: 1933–1947] [Data set]. Complexul Național Muzeal ASTRA. https://revistatransilvania.ro/mdrr
Baghiu, Ș., Pojoga, V., Borza, C., Coroian-Goldiș, A., Gârdan, D., Modoc, E., Susarenco, T., Vancu, R., & Varga, D. (2019a). Muzeul digital al romanului românesc: Secolul al XIX-lea [The digital museum of the Romanian novel: The 19th century] [Data set]. Complexul Național Muzeal ASTRA. https://revistatransilvania.ro/mdrr
Baghiu, Ș., Pojoga, V., & Sass, M. (Eds.). (2019b). Ruralism and literature in Romania. Peter Lang.
Bart J. (1933). Europolis. Editura Adeverul.
Bâlici, M. (2018). The emergence of quantitative studies: Actual functionalities and the Romanian case. Metacritic Journal for Comparative Studies and Theory, 4(2), 54–71.
Bâlici, M. (2019). Studii cantitative recente în spațiul românesc. Între analiză instituțională și problema traducerilor [Recent quantitative studies in Romania: Between institutional analysis and the topic of translation]. Transilvania, 2, 11–18.
Barbu Mititelu, V., Irimia, E., & Tufiș, D. (2014). CoRoLa—The Reference Corpus of Contemporary Romanian Language. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1235–1239. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/360_Paper.pdf
Barbu Mititelu, V., Irimia, E., Păiș, V., Avram, A.-M., Mitrofan, M., & Curea, E. (2020). Romanian resources in linguistic linked open data format. The 15th International Conference on Linguistic Resources and Tools for Natural Language Processing, 29–40.
Barnoschi, D. V. (1942). Stăvilare. Se aleg apele [Floodgates. The waters clear]. Editura Națională Gh. Mecu.
Borza, C. (2019). How to populate a country: A quantitative analysis of the rural novel from Romania (1900–2000). In Ş. Baghiu, V. Pojoga, & M. Sass (Eds), Ruralism and literature in Romania (pp. 21–39). Peter Lang.
Busuioc, M., & Caragea, D. (2019). ROMTEXT: Flux de obținere și tratare a textelor [Romtext: Flow for electronic texts’ generation and treatment]. Studii și Cercetari Lingvistice, 70(1), 118–123. Scopus.
Ciorogar, A., & Modoc, E. (2019, February 8). Analiza computațională în cadrul studiilor literare românești [Computational analysis in Romanian literary studies]. Observator Cultural, 981. https://www.observatorcultural.ro/articol/analiza-computationala-in-cadrul-studiilor-literare-romanesti/
Ciorogar, A., Modoc, E., Goldiș, A., Mudure, M., & Ursa, M. (2019, January). Studiile cantitative și provocările lor [Quantitative studies and their challenges]. Cultura, 1(597), 33–39.
Cohen, M. (2002). The sentimental education of the novel (2nd. ed.). Princeton University Press.
Coroian-Goldiș, A., Gârdan, D., Morariu, D., Borza, C., Modoc, E., & Susarenco, T. (2019). Arhivele romanului românesc și posibilități de digitizare [The archives of the Romanian novel and digitization possibilities]. Revista Transilvania, 10, 1–8.
Cotovu, S. (1947). Vijelie [Storm]. Editura Casa Școalelor.
Damrosch, D. (2006). World literature in a postcanonical, hypercanonical age. In H. Saussy (Ed.), Comparative literature in an age of globalization (pp. 43–53). Johns Hopkins University Press. https://www.academia.edu/download/38530018/WL_in_a_Postcanonical_Age.pdf
Del-Vet, A. (1936). Două lumi [Two worlds]. Editura Librăria Academică.
Dessila, O. (1946). Porți fără număr [Unnumbered gates] (vol. 2). Editura Cartea Românească.
Dinu, L. P., Niculae, V., & Sulea, M.-O. (2012). Pastiche detection based on stopword rankings: Exposing impersonators of a Romanian writer. In E. Fitzpatrick, J. Bachenko & T. Fornaciari (Eds.), Proceedings of the Workshop on Computational Approaches to Deception Detection, 72–77. Association for Computational Linguistics. https://aclanthology.org/W12-0411
Est, E. (1939). Zaza. Editura Țicu I. Eșanu.
Explosion. (n.d.-a). Models and Languages. https://spaCy.io/usage/models
Explosion. (n.d.-b). Multi-language. https://spaCy.io/models/xx.
Florescu, M. (1945). Voluntarii [The volunteers]: vol. 1. Spre Spania [Toward Spain]. Editura Scânteia.
Gârdan, D. (2018). Mapping emotions in the Romanian erotic novel of the interwar period: Canonical affect and popular sensibility. Dacoromania Litteraria, 5(1), 101–114.
Gârdan, D. (2019). Interstitial spatiality in the Romanian novel of the interwar period: Mute rurality and subverted urbanity. In Ş. Baghiu, V. Pojoga & M. Sass (Eds), Ruralism and literature in Romania (pp. 69–80). Peter Lang.
Gârdan, D., & Modoc, E. (2020). Mapping literature through quantitative instruments: The case of current Romanian literary studies. Interlitteraria, 25(1), Article 1. https://doi.org/10.12697/IL.2020.25.1.6
Gavrilă, V., Băjenaru, L., Dobre, C., & Tomescu, M. (2021). Towards the development of a Romanian lexicon for the analysis of emotions in the literary works of canonical authors. Studies in Informatics and Control, 30(2), 111–120. https://doi.org/10.24846/v30i2y202110
Généreux, M., & Spano, D. (2015). NLP challenges in dealing with OCR-ed documents of derogated quality. Workshop Proceedings ‘Replicability and Reproducibility in Natural Language Processing: Adaptive Methods, Resources and Software’ at IJCAI 2015, 6. https://www.researchgate.net/publication/281112670_NLP_challenges_in_dealing_with_OCR-ed_documents_of_derogated_quality
Goldiș, A. (2014). Digital humanities – o nouă paradigmă teoretică? [Digital humanities – A new theoretical paradigm?]. Revista Transilvania, 12, 1–4.
Goldiș, A., & Modoc, E. (2020). Distant Reading – o nouă paradigmă de cercetare literară? [Distant reading – A new literary research paradigm?]. Revista Vatra, 8–9, 44–121.
Graur, A. (2009). Mic tratat de ortografie [Brief orthography treatise] (L. Groza, Ed.). Humanitas.
Hanu, B., Vlad, A., & Mitrea, A. (2018). Aspects revealing the orthography and punctuation impact in printed Romanian: A literary corpus based study. 2018 International Conference on Communications (COMM), 95–100. https://doi.org/10.1109/ICComm.2018.8484819
Ion, R., Irimia, E., & Barbu Mititelu, V. (2018). Ensemble Romanian dependency parsing with neural networks. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 1574–1579.
Istrate, I., Milea, I., Modola, D., Pop, A., Popa, M., Sasu, A., Stan, E., Taşcu, V., & Vartic, M. (2004). Dicționarul cronologic al romanului românesc de la origini până la 1989: DCRR [The chronological dictionary of the Romanian novel from its origins to 1989]. Editura Academiei Române.
Kim, A., Pethe, C., Inoue, N., & Skiena, S. (2021). Cleaning dirty books: Post-OCR processing for previously scanned texts. Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021, 4217–4226. Association for Computational Linguistics.
McGillivray, B., Poibeau, T., & Fabo, P. R. (2020). Digital humanities and natural language processing: “Je t’aime... Moi non plus”. Digital Humanities Quarterly, 14(2). https://doi.org/10.17863/CAM.55816
Mieskes, M., & Schmunk, S. (2019). OCR quality and NLP preprocessing. In A. Axelrod, D. Yang, R. Cunha, S. Shaikh & Z. Waseem (Eds.), Proceedings of the 2019 Workshop on Widening NLP (pp. 102–105). Association for Computational Linguistics. https://aclanthology.org/W19-3633
Missir, I. (1937). Fata moartă [The dead girl]. Editura Cartea Românească.
Modoc, E. (2018). Traveling avant-gardes: The case of futurism in Romania. In Ş. Baghiu, V. Pojoga & M. Sass (Eds), The culture of translation in Romania (pp. 47–65). Peter Lang.
Modoc, E. (2020). Internaţionala periferiilor: Reţeaua avangardelor din Europa Centrală şi de Est [The international of peripheries: Avant-garde networks of East-Central Europe]. Editura Muzeul Literaturii Române.
Modoc, E., & Gârdan, D. (2020). Style at the scale of the canon: A stylometric analysis of 100 Romanian novels published between 1920 and 1940. Metacritic Journal for Comparative Studies and Theory, 6(2), 48–63. https://doi.org/10.24193/mjcst.2020.10.03
Montani, I., Honnibal, M., Honnibal, M., Boyd, A., Van Landeghem, S., & Peters, H. (2023). Explosion/spaCy: V3.7.2: Fixes for APIs and requirements (Version v3.7.2) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.10009823
Morariu, D., Stoica, A.-M., Farmatu, T., Vancu, R., & Varga, D. (2021). Poli de producție ai romanului românesc (1933-1947): Rețele editoriale și forme de canonizare [Production centers of the Romanian novel (1933-1947): Editorial networks and forms of canonization]. Revista Transilvania, 9, 35–42. https://doi.org/10.51391/trva.2021.09.05
Moretti, F. (2005). Graphs, maps, trees: Abstract models for a literary history. Verso.
Neagu, L. M., Dascalu, M., Trausanmatu, S., Simion, E., & Chisu, L. (2020). Automated modeling of Romanian literary trends in history using topics over time and co-occurences. Proceedings of the 16th International Scientific Conference ‘eLearning and Software for Education’ Bucharest, April 30 - May 1, 2020, 151–158. https://doi.org/10.12753/2066-026X-20-019
Nicolaescu, M., & Mihai, A. (2014). Teaching digital humanities in Romania. CLCWeb: Comparative Literature and Culture, 16(5). https://doi.org/10.7771/1481-4374.2497
Nunberg, G. (1990). The linguistics of punctuation. Center for the Study of Language and Information.
Olaru, O. (2019). What is digital humanities and what’s it doing in Romanian departments? Revista Transilvania, 2019(5–6), 30–37.
Păiș, V., Ion, R., Avram, M., Mitrofan, M., & Tufiș, D. (2021). In-depth evaluation of Romanian natural language processing pipelines. Romanian Journal of Information Science and Technology, 24(4), 384–401.
Palmer, D. D. (2010). Text preprocessing. In N. Indurkhya & F. J. Damerau, Handbook of natural language processing (2nd ed., pp. 9–30). Chapman and Hall/CRC.
Patraş, R., Galleron, I., Grădinaru, C., Lionte, I., & Pascaru, L. (2019). The splendors and mist(eries) of Romanian digital literary studies: A state-of-the-art just before Horizons 2020 closes off. Hermeneia, 23, 207–222.
Patraș, R., Galleron, I., Pascariu, L., Olteanu, A., Lionte, I., & Grădinaru, C. (2020). HAIRO: corpus de romane haiducești (1850-1950) [HAIRO: Corpus of hajduk novels (1850-1950)] [Data set]. https://www.nakala.fr/page/collection/11280/51a881d1
Petrescu, C. (1943). 1907: vol. 3. Pământ, mormânt… [Land, grave…]. Cugetarea – Georgescu Delafras.
Pojoga, V. (2018). A survey of poetry translations in Romanian periodicals (1990–2015). In M. Sass, Ş. Baghiu & V. Pojoga (Eds), The culture of translation in Romania (pp. 99–121). Peter Lang.
Pojoga, V., Baghiu, Ş., Modoc, E., Gârdan, D., & Coroian Goldiș, A. (2019). Tehnici digitale pentru analiza romanului românesc [Digital tools for the analysis of the Romanian novel]. Revista Transilvania, 10, 9–16. https://revistatransilvania.ro/wp-content/uploads/2019/12/02.Vlad-Pojoga-Stefan-Baghiu-Emanuel-Modoc-Daiana-Gardan-Andreea-Coroian-Goldis.pdf
Pojoga, V., Neagu, L. M., & Dascalu, M. (2020). The character network in Liviu Rebreanu’s Ion: A quantitative analysis of dialogue. Metacritic Journal for Comparative Studies and Theory, 6(2), 23.
Popescu, M., & Dinu, L. P. (2008). Rank distance as a stylistic similarity. In D. Scott & H. Uszkoreit (Eds), Coling 2008: Companion volume: Posters (pp. 91–94). Coling 2008 Organizing Committee. https://aclanthology.org/C08-2023
Poppler (n.d.) [Computer software]. https://poppler.freedesktop.org/
Prinsloo, D., Taljard, E., & Goosen, M. (2022). Optical character recognition and text cleaning in the indigenous South African languages. Stellenbosch Papers in Linguistics Plus, 64, 165–187. https://doi.org/10.5842/64-1-867
Python Software Foundation. (n.d.). difflib – Helpers for computing deltas. https://docs.python.org/3/library/difflib.html
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A python natural language processing toolkit for many human languages. In A. Celikyilmaz & T.-H. Wen (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 101–108). https://doi.org/10.18653/v1/2020.acl-demos.14
Rebreanu, L. (1940). Amândoi [Both of them]. Editura Socec & Co.
Sass, M., Baghiu, Ş., & Pojoga, V. (2018). The culture of translation in Romania. Peter Lang.
Stanford NLP Group. (n.d.-a). Language identification. https://stanfordnlp.github.io/stanza/langid.html
Stanford NLP Group. (n.d.-b). Model performance. https://stanfordnlp.github.io/stanza/performance.html
Stănoiu, D. (1934). Parada norocului [The parade of fortune]. Editura Librăriei Universala Alcalay & Co.
Strange, C., McNamara, D., Wodak, J., & Wood, I. (2014). Mining for the meanings of a murder: The impact of OCR quality on the use of digitized historical newspapers. Digital Humanities Quarterly, 8(1). https://digitalhumanities.org/dhq/vol/8/1/000168/000168.html
Streinul, M. (1944). Băieți de fată [Illegitimate sons]. Editura ABC.
Take. (1935). Nunta Lizetii [Lizeta’s wedding]. Editura Cartea Românească.
Terian, A., Baghiu, Ș. (2025). Digitising literary heritage: Some lessons from the Digital Museum of the Romanian Novel. In T. Lähdesmäki, J. Turunen, A. Terian & R. Garcia-Bardidia (Eds.), Engaging communities in cultural heritage (pp. 113-127). Routledge.
Terian, A. (2019). Big numbers: A quantitative analysis of the development of the novel in Romania. Transylvanian Review, XXVIII (Suppl. 1), 55–74.
Terian, A., Farmatu, T., Borza, C., Varga, D., Văsieș, A., & Morariu, D. (2021). Genurile romanului românesc (1933-1947). O analiză cantitativă [The genres of the Romanian novel (1933-1947): A quantitative analysis]. Transilvania, 9, 43–54. https://doi.org/10.51391/trva.2021.09.06
Tufiș, D., & Dan, C. (2018). A bird’s-eye view of language processing projects at the Romanian Academy. In N. C. (Conference chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis & T. Tokunaga (Eds), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA).
Unicode. (n.d.). Unicode normalization FAQ. https://unicode.org/faq/normalization.html
Universal Dependecies. (n.d.-a). Statistics of UD Romanian RRT. https://universaldependencies.org/treebanks/ro_rrt/index.html#statistics-of-ud-romanian-rrt
Universal Dependencies. (n.d.-b). UD for Romanian. https://universaldependencies.org/ro/index.html
Universal Dependencies. (n.d.-c). Universal Dependencies. https://universaldependencies.org/
Ursa, M. (2015). Is Romanian culture ready for the digital turn?. Metacritic Journal for Comparative Studies and Theory, 1, 80–97.
Van Strien, D., Beelen, K., Ardanuy, M., Hosseini, K., McGillivray, B., & Colavizza, G. (2020). Assessing the impact of OCR quality on downstream NLP tasks. Proceedings of the 12th International Conference on Agents and Artificial Intelligence, 484–496. https://doi.org/10.5220/0009169004840496
Vătavu, B., & Morariu, D. (2024). Enhancing usability of digital collections: Accuracy assessment and OCR post-correction of the Digital Museum of the Romanian Novel. Transilvania, 10, 66–75. https://doi.org/10.51391/trva.2024.10.08
Vintilă-Rădulescu, I., Rădulescu Sala, M., & Aranghelovici, C. (2021). Dicţionarul ortografic, ortoepic şi morfologic al limbii române: DOOM (Ediţia a 3-a revăzută şi adăugită) [Orthographic, orthoepic and morphological dictionary of the Romanian language. Revised and updated, 3rd Edition]. Univers Enciclopedic Gold.
Xiao, R. (2010). Corpus creation. In N. Indurkhya & F. J. Damerau (Eds), Handbook of natural language processing (2nd ed., pp. 147–165). Chapman and Hall/CRC.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Bogdan Vătavu, David Morariu

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial 4.0 International License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).