Tīmeklis2020 Search Word Frequency List
CommonCrawl of Latvian 2020
The corpus was created within the CommonCrawl project. Latvian web pages published from 2013 to 2020 were collected. The selected texts are automatically morphologically annotated; for morphologically ambiguous forms a single, most likely match is listed. Duplicate paragraphs have been removed from the original data, that should be taken into account when analyzing the frequency of words.
Corpus size | 403.6M words (492.6M tokens) |
Data period | 2013–2022 |
Development period | 2020–2022 |
Developers | Institute of Mathematics and Computer Science UL |
Funding | State Research Programme "Digital Resources for Humanities" (VPP-IZM-DH-2020/1-0001) |