5th July 2024

(meeting in Polish)

Krzysztof Nowak, Rafał Górski, Michał Woźniak, Dorota Mika, Wojciech Guz and Wojciech Łukasik (Institute of Polish Language PAS)

Presentation of the Speech Corpus created as part of the Dariah.lab project – digital research infrastructure for the humanities and art sciences

The Speech Corpus is an infrastructure for creating and archiving conversational data, which was created as part of the “Digital research infrastructure for the humanities and art sciences” project conducted in 2020-2023 by the DARIAH-PL scientific consortium.

The project obtained a total of over 1,000 hours of recordings from the websites acast.com, newonce.net, soundcloud.com, spreaker.com and youtube.com, which were used to develop the Dariah.lab corpus infrastructure and made available in the form of annotated corpora with a sound layer. This collection documents the use of spoken Polish in years 2011-2020 and later.

The data is available in the form of a corpus at https://korpusmowy.ijp.pan.pl/. The corpora published on the platform consist of two layers – sound (recordings) and text (transcriptions of the recordings). The recordings can be explored using the SpoCo search engine. In our talk, we will present ways of searching the corpus and using it for linguistic research.

To participate online please sign up here: https://forms.gle/4K1MJ7V9JW8MDKmq7