Corpora with tag text (25)

LVK2018

The Balanced Corpus of Modern Latvian

2016–2018, 10M words (12M tokens)
Developers: IMCS UL

MuLa2022

Corpus of Contemporary Latgalian Texts 2022

2020–2022, 2M words (2.8M tokens)
Developers: RAT, IMCS UL

LaVA

Latvian Language Learner Corpus

2018–2021, 192k words (241k tokens)
Developers: IMCS UL

LVTB

Latvian Treebank

2010–2022, 16803 sentences (282 167 tokens)
Developers: IMCS UL

Barometrs

Corpus of News Portal Comments

2011–2022, 26M comments (642M tokens)
Developers: RSU, IMCS UL

Disertācijas

Corpus of Latvian PhD Theses

2022, 16.7M words (23.4M tokens)
Developers: IMCS UL

Emuāri

Latvian Blog Corpus 2015

2014–2015, 6.6M words (8M tokens)
Developers: IMCS UL

FullStack-LV

Full Stack of Latvian Language Resources

2017–2019, 13691 sentences
Developers: IMCS UL

Hugo.lv

Hugo.lv Parallel Corpora

2018, 10.5M tokens
Developers: KISC

LatSenRom

Corpus of Latvian Early Novels

2019–2022, 4,6M words (5.8M tokens)
Developers: NLL, ILFA UL

Likumi

Corpus of Legal Acts of the Republic of Latvia

2022, 73.9M words (116.2M tokens)
Developers: IMCS UL

LiLa

Lithuanian-Latvian-Lithuanian Parallel Text Corpus

2011–2013, 8M words
Developers: IMCS UL, VMU

MuLa2012

Corpus of Contemporary Latgalian Texts 2012

2011–2013, 1M words (1.3M tokens)
Developers: IMCS UL, RAT

PanDi

Corpus of Latvian Pandemic Diaries

2020–2022, 565k words (709k tokens)
Developers: ILFA UL

Pārspriedumi

Corpus of Students' Essays

2018–2021, 185k words (226k tokens)
Developers: IMCS UL, LiepU, RAT

Rainis

Corpus of Texts Written by Rainis

2018, 1.6M words (2.3M tokens)
Developers: IMCS UL

Saeima

Corpus of the Saeima (the Parliament of Latvia)

2013–2018, 21M words (24M tokens)
Developers: IMCS UL, RSU

Senie

Corpus of Early Written Latvian Texts

2002–.., 2M words (2.7M tokens)
Developers: LLI UL, FH UL, IMCS UL

Tīmeklis2007

Latvian Web Corpus 2007

2006–2007, 99M words (123M tokens)
Developers: IMCS UL

Tīmeklis2020

CommonCrawl of Latvian 2020

2020–2022, 403.6M words (492.6M tokens)
Developers: IMCS UL

UDLV-LVTB

Latvian UD Treebank

2015–2022, 16951 sentences (285 425 tokens)
Developers: IMCS UL

Vikipēdija

Latvian Wikipedia

2022, 17.9M words (27.7M tokens)
Developers: IMCS UL

VVPP

Corpus of the Tests of the State Language Proficiency Testing

2017–2018, 150k tokens
Developers: IMCS UL

Ziņas

Articles from Latvian news portals

2022, 357.2M words (513.5M tokens)
Developers: IMCS UL

ĪsprozaS

Corpus of Latvian Women Writers’ Short Fiction

2020–2022, 925k words (1.2M tokens)
Developers: ILFA UL