Paragraphs and excerpts, or: 1830-1918 micro-corpus

Magdalena Derwojedowa

In my talk, I will present one-million corpus of Polish in 1830-1918, consisting of small samples of the texts from the period. It was built for the purposes of the project “Automatic morphological analysis of Polish texts from 1830-1918 period with respect to evolution of inflection and spelling” (DEC-2012/07/B/HS2/00570).

I will start with presenting the microstructure of the corpus: sampling, metadata and source files, and I will briefly discuss the problems we encountered while working on the samples. In the second part, I will present the macrostructure of the corpus, its split into subcorpora and achieved variation of the samples. At the end I will present selected studies of linguistic phenomena that can be performed with the corpus.

The corpus with the online search engine polyqarp is available online: Search in dictionaries (https://szukajwslownikach.uw.edu.pl/f19/).