LVK2022 Search Word Frequency List
The Balanced Corpus of Modern Latvian
The Balanced Corpus of Modern Latvian, which contains unique texts not yet included in other so far developed balanced corpora (LVK2013 and LVK2018). The corpus is primarily based on the design principles of previous balanced corpora. It contains authentic contemporary texts (mostly created after 2000) of various genres with metadata. Unlike its predecessors, this balanced corpus contains texts in the original language as well as translations. When selecting the texts to be included in the corpus from the web, first all current pages from one domain are collected and the content corresponding to the corpus is retrieved. The next processing step consisted of dividing the text into paragraphs and deleting duplicates or paragraphs irrelevant to the corpus (texts in foreign languages, tables, etc.). Paragraphs in some fiction documents have been rearranged alphabetically to comply with the contractual obligations to publishing companies. The balanced corpus has been comprised of the processed documents according to the following proportions of language genres: journalism (60%), fiction (10%), scientific (10%), Wikipedia (7%), legal (7%), parliamentary transcripts (3%) and subtitles (3%).
The Balanced Corpus of Modern Latvian (LVK2022)
CLARIN-LV digital library, 2022
http://hdl.handle.net/20.500.12574/84
Corpus size | 101M words (123M tokens) |
Data period | 2000–2021 |
Development period | 2019–2022 |
Developers | Institute of Mathematics and Computer Science UL |
Funding | Latvian Language Agency |
CLARIN | http://hdl.handle.net/20.500.12574/84 |