LVK2022  Search Word Frequency List

The Balanced Corpus of Modern Latvian

The Balanced Corpus of Modern Latvian, which contains unique texts not yet included in other so far developed balanced corpora (LVK2013 and LVK2018). The corpus is primarily based on the design principles of previous balanced corpora. It contains authentic contemporary texts (mostly created after 2000) of various genres with metadata. Unlike its predecessors, this balanced corpus contains texts in the original language as well as translations. When selecting the texts to be included in the corpus from the web, first all current pages from one domain are collected and the content corresponding to the corpus is retrieved. The next processing step consisted of dividing the text into paragraphs and deleting duplicates or paragraphs irrelevant to the corpus (texts in foreign languages, tables, etc.). Paragraphs in some fiction documents have been rearranged alphabetically to comply with the contractual obligations to publishing companies. The balanced corpus has been comprised of the processed documents according to the following proportions of language genres: journalism (60%), fiction (10%), scientific (10%), Wikipedia (7%), legal (7%), parliamentary transcripts (3%) and subtitles (3%).

K. Levāne-Petrova, R. Darģis, K. Pokratniece, V. J. Lasmanis
The Balanced Corpus of Modern Latvian (LVK2022)
CLARIN-LV digital library, 2022
Corpus size 101M words (123M tokens)
Development period 2019–2022
Developers Institute of Mathematics and Computer Science UL
Funding Latvian Language Agency