Text similarity computation in large collections of Finnic oral folk poetry

Maciej Janicki (University of Helsinki)

Maciej Janicki is a postdoctoral researcher in Digital Humanities at the University of Helsinki. He obtained his PhD in Computer Science from the University of Leipzig in 2019 for a thesis on unsupervised learning of morphology. His current main interest are unsupervised and statistical methods for processing non-standard linguistic data.

In this talk I show the application of computational methods for text similarity detection to a large corpus of Finnic oral folk poetry, numbering over 280,000 texts from combined Finnish and Estonian archival collections. The corpus has been compiled by merging several existing large collections within the Academy of Finland project “Formulaic intertextuality, thematic networks and poetic variation across regional cultures of Finnic oral folk poetry” (FILTER), a consortium that includes researchers from the Finnish Literature Society, the University of Helsinki and the Estonian Literary Museum.

The computational text similarity detection on the level ranging from individual lines to entire texts produces results that can be used in a threefold way. First, they enable large-scale quantitative views on the collections. Second, they aid qualitative research by providing links to similar and related texts, thus helping the researcher to find all potentially relevant material for a focused study. Thirdly, they provide additional criteria for which the collections can be queried, allowing to explore the less known parts of the collections. I will illustrate these use cases with examples from our ongoing project.

To participate please sign up here: https://forms.gle/4K1MJ7V9JW8MDKmq7