Tīmeklis2020 Search Word Frequency List

CommonCrawl of Latvian 2020

The corpus was created within the CommonCrawl project. Latvian web pages published from 2013 to 2020 were collected. The selected texts are automatically morphologically annotated; for morphologically ambiguous forms a single, most likely match is listed. Duplicate paragraphs have been removed from the original data, that should be taken into account when analyzing the frequency of words.

text (36) general (11) web (3) morphology (41)

Corpus size	403.6M words (492.6M tokens)
Data period	2013–2022
Development period	2020–2022
Developers	Institute of Mathematics and Computer Science UL
Funding	State Research Programme "Digital Resources for Humanities" (VPP-IZM-DH-2020/1-0001)