Analyzing Translated Czech in a Monolingual Comparable Corpus: Possibilities and Limitations

Lucie ChlumskaLucie Chlumská (Univerzita Karlova)

The aim of the talk is to present a large quantitative study of translated Czech, focused on the so-called T-universals (such as simplification or convergence), and at the same time to discuss the limitations of monolingual comparable corpus design that could influence the research. Possible implications for data analysis and interpretation will be shown on the example of the Jerome corpus of translated and non-translated Czech. The discussion will specifically cover the following topics: text size in comparable corpora and its impact on statistical testing, text type / genre diversity issue in lexical analysis (based on n-gram and POS-gram extraction) and possible consequences of one source language prevalence.