Tīmeklis2020  Search Word Frequency List

CommonCrawl of Latvian 2020

The corpus was created within the CommonCrawl project. Latvian web pages published from 2013 to 2020 were collected. The selected texts are automatically morphologically annotated; for morphologically ambiguous forms a single, most likely match is listed. Duplicate paragraphs have been removed from the original data, that should be taken into account when analyzing the frequency of words.

Corpus size 403.6M words (492.6M tokens)
Development period 2020–2022
Developers Institute of Mathematics and Computer Science UL
Funding State Research Programme "Digital Resources for Humanities" (VPP-IZM-DH-2020/1-0001)